Python @programming.dev SandbagTiara2816 @lemmy.dbzer0.com 10 mo. ago

Modules for extracting data from PDF?

I’m not a software developer, but I like to use Python to help speed up some of my office work. One of my regular tasks is to print a stack of ~40 sheets of paper, highlight key information for each entry (about 3 entries per page), and fill out a spreadsheet with that information that then gets loaded into our software.

This is time-consuming, and I’d like to write a program that can scan the OCR-ed PDFs and pull the relevant information into a CSV.

I’m confident I could handle it from there, but I know that PDFs are tricky files to work with. Are there any Python modules that might be a good fit for the approach I’m hoping to take here? Thanks!

You're viewing a single thread.

5 comments

pypdf, recently been updated to version 3... it sometimes takes a bit of wrangling for more specific use cases: I've used it in conjunction with reportlab when needing to add text and other bits with a bit more flexibility.
- From what I understand PyPDF3 and 4 are separate from pypdf which is the modern version of PyPDF2 as of last year
  
  source link
  
  That's correct afaik. The maintainers of PyPDF2 merged it back into the original pypdf for version 3 I believe.