# PDF

Main reference: PyMuPDF

# PDF Structure

Most cases page.rect == page.MediaBox == page.CropBox, and some cases page.CropBox is a subset of page.MediaBox. Moreover, origin of page.MediaBox should be (0, 0). If the page is rotated, page.rect might not equal page.CropBox

WARNING

pdf's origin is at bottom left, but PyMuPDF transformed it to top left

For abnormal pdfs, after opening it, the element position is based on page.rect as expected, but page.insertText(bottom_left_point, text) might not work

# Save Bytes to PDF

method one:

with open('123.pdf', 'wb') as saver:
    saver.write(bytes_input)

method two:

import fitz

doc = fitz.open(stream=bytes_input, filetype='pdf')
# add deflate=True to save space
# add no_new_id=True to make md5 consistent
doc.save('123.pdf')

# Extract images

method one:

img_blocks = [b for b in pdf_page.get_text('blocks') if b['type'] == 1]
img_blocks = [b for b in pdf_page.get_text('dict')['blocks'] if b['type'] == 1]  # slower due to image content added

method two:

# https://pymupdf.readthedocs.io/en/latest/page.html#Page.get_image_info
pdf_page.get_image_info()
pdf_page.get_image_info(xrefs=True)  # double the time

method three:

# https://pymupdf.readthedocs.io/en/latest/document.html#Document.get_page_images
# very little information but the fastest
pdf_page.get_images()

# continue to extract
pdf_page.get_image_bbox(img_tuple)
pdf_doc.extract_image(img_tuple[0])['image']

# Image to PDF

new_doc = fitz.Document('debug.png')
new_doc = new_doc.convert_to_pdf()
with open('debug.pdf', 'wb') as writer:
    writer.write(new_doc)

image to image

pix = fitz.Pixmap("input.xxx")  # any supported input format
pix.save("output.yyy")  # any supported output format

# Pending

  1. How to get the image drawing order? partial solution
Last Updated: 2/1/2024, 4:22:58 PM