Searchable PDF on the command line (OCR of PDF)

betso · November 13, 2019, 10:16pm

Since this is my first posting here, let me start with first things first.
Thank you for the great work creating an awesome distro!

Now to my question:
Does anyone have a suggestion about creating searchable PDFs out of scanned documents?
I was not able to get information on how to run e.g. pypdfocr or pdfsandwich. I tried to find a way to install tesseract but I can get it installed only as a part of a flatpak (which is not an option using it on a command line).
I just need to be able making a bunch of PDFs searchable. Anyone has an idea?

Thank you!

Businux · November 14, 2019, 1:37am

doct0rHu · November 14, 2019, 1:41am

Hi you’re mixing two things, one is to search a given query in PDF files, and another is to generate a layer of text by OCR.

For searching, you may try ripgrep-all.

For OCR, tesseract is a solid choice and you can compile it from the source. Instructions are available here. I think most dependencies are available via swupd, but I’m not sure about Leptonica and you might need to compile it before tesseract.

Businux · November 14, 2019, 1:50am

betso · November 14, 2019, 12:30pm

Thank you, @doct0rHu!
This is the information that I needed: I need to compile tesseract from source.

I understand the two parts of creating a searchable PDF and this is the reason I wrote that I need “creating searchable PDFs out of scanned documents”. This implies OCRing the PDF and creating the text layer.
I was just not sure that I am not missing a bundle (that is not a flatpak) which would contain that.

Again, thank you!

P.S. Is @Businux a bot?

Businux · November 14, 2019, 3:11pm

LOL, no

Give Recoll a try, you might like it.

betso · November 14, 2019, 3:58pm

Haha, thank you, @Businux!

I have tried Recoll couple of years ago and it is indeed a good tool. I just need an option to OCR 500+ PDF files, so I would be able to pdfgrep them later. That’s why I am looking for a command line solution, so I can write a script to do the job.
I’ll take a look at DocFetcher though; didn’t know it.

Thanks a lot anyway!

doct0rHu · November 14, 2019, 4:48pm

I guess ripgrep-all shall be fast enough, given that ripgrep is the fastest among grep-like tools.

Businux · November 14, 2019, 9:22pm

Since very recently, Recoll has a Gnome Shell Search Provider.

Perfect for Clear Linux integration.

Unlike Nepomuk and similar software, it doesn’t run as a daemon, just on demand, saving resources.

btwarden · November 15, 2019, 9:21pm

I’ve added a tesseract bundle, although we don’t bundle the trained language models, so you’ll have to download those and set the TESSDATA_PREFIX variable to point to the directory containing them.

betso · November 15, 2019, 9:32pm

Awesome!
Setting TESSDATA_PREFIX is easy.
Thank you so much @btwarden!

Topic		Replies	Views
Nautilus Full Text Search Content in Files General Discussion	1	802	April 3, 2022
Full-fledged PDF editor General Discussion	10	1000	May 24, 2024
Is there an equivalent to Synaptic for CL? [Possibly no longer needed] General Discussion	3	1006	February 14, 2020
How do you guys "search" the "Clear Linux* Store"? Q&A	5	130	October 7, 2024
How to run fsearch in clear linux? Q&A	15	1068	April 17, 2020

Searchable PDF on the command line (OCR of PDF)

Related topics