Wednesday, March 4, 2020

Optimizing PDF File Size

Patrik Weiskircher:

However, [incremental saving] also causes the file size to grow and grow and never get smaller. This is especially noticeable if you work with a lot of images. And even if you remove an image, it still is included in the PDF; you only instruct your PDF viewer to not show it again.


Another nice feature of PDFs is that objects like fonts and images can be shared across pages — a feature that was specifically made in an effort to save on the size of files. This means you can have an image logo on each page and it is only included in the PDF file once.


So right before we start saving the document, we go through the entire PDF file and collect a list of all the reachable object numbers. Then, when saving the PDF, but before we write out an indirect object, we compare its objects number with the list we collected, and if the object isn’t included, we simply don’t write it out.

I can’t tell from this whether they coalesce duplicate objects or only garbage collect unreferenced ones.

1 Comment RSS · Twitter

My understanding is that they just garbage collect. The reason they point out about the multiple references to a single object, is that it's the reason why image data is typically not removed when PDFs are edited, because that would require extra care in making sure the image is not referenced elsewhere, and slow down the editing.

Leave a Comment