Main reference: PyMuPDF
# PDF Structure
Most cases page.rect == page.MediaBox == page.CropBox
, and some cases page.CropBox
is a subset of page.MediaBox
. Moreover, origin of page.MediaBox
should be (0, 0). If the page is rotated, page.rect
might not equal page.CropBox
WARNING
pdf's origin is at bottom left, but PyMuPDF
transformed it to top left
For abnormal pdfs, after opening it, the element position is based on page.rect
as expected, but page.insertText(bottom_left_point, text)
might not work
materials
# Save Bytes to PDF
method one:
with open('123.pdf', 'wb') as saver:
saver.write(bytes_input)
method two:
import fitz
doc = fitz.open(stream=bytes_input, filetype='pdf')
# add deflate=True to save space
# add no_new_id=True to make md5 consistent
doc.save('123.pdf')
# Extract images
method one:
img_blocks = [b for b in pdf_page.get_text('blocks') if b['type'] == 1]
img_blocks = [b for b in pdf_page.get_text('dict')['blocks'] if b['type'] == 1] # slower due to image content added
method two:
# https://pymupdf.readthedocs.io/en/latest/page.html#Page.get_image_info
pdf_page.get_image_info()
pdf_page.get_image_info(xrefs=True) # double the time
method three:
# https://pymupdf.readthedocs.io/en/latest/document.html#Document.get_page_images
# very little information but the fastest
pdf_page.get_images()
# continue to extract
pdf_page.get_image_bbox(img_tuple)
pdf_doc.extract_image(img_tuple[0])['image']
# Image to PDF
new_doc = fitz.Document('debug.png')
new_doc = new_doc.convert_to_pdf()
with open('debug.pdf', 'wb') as writer:
writer.write(new_doc)
pix = fitz.Pixmap("input.xxx") # any supported input format
pix.save("output.yyy") # any supported output format
# Pending
- How to get the image drawing order? partial solution