Troubleshooting PDF OCR using Python on Mac

I wrote a script to extract some text from a PDF (image-based text, so pdftotext wouldn’t work).

Using pdf2image convert_from_path I simply could not get any data out of the pdf. I tried multiple PDFs while testing and convert_from_path just kept returning an empty variable.

Turned out that my homebrew install of xpdf was interfering with my homebrew install of poppler.

Uninstalling xpdf (brew uninstall xpdf) and reinstalling poppler (brew install poppler) seemed to fix things up. My suspicion is that they both come with their own versions of pdfinfo which is used by pdf2image. Just a hunch, I don’t know enough about what’s going on under the hood. So, anyway, if pdf2image isn’t working correctly for you and you’re on a Mac, make sure you’ve got poppler installed and that xpdf’s pdfinfo isn’t being used.

Posted

in