Searchable PDF on the command line (OCR of PDF)

Since this is my first posting here, let me start with first things first.
Thank you for the great work creating an awesome distro!

Now to my question:
Does anyone have a suggestion about creating searchable PDFs out of scanned documents?
I was not able to get information on how to run e.g. pypdfocr or pdfsandwich. I tried to find a way to install tesseract but I can get it installed only as a part of a flatpak (which is not an option using it on a command line).
I just need to be able making a bunch of PDFs searchable. Anyone has an idea?

Thank you!

1 Like

Hi you’re mixing two things, one is to search a given query in PDF files, and another is to generate a layer of text by OCR.

For searching, you may try ripgrep-all.

For OCR, tesseract is a solid choice and you can compile it from the source. Instructions are available here. I think most dependencies are available via swupd, but I’m not sure about Leptonica and you might need to compile it before tesseract.

1 Like
1 Like

Thank you, @doct0rHu!
This is the information that I needed: I need to compile tesseract from source.

I understand the two parts of creating a searchable PDF and this is the reason I wrote that I need “creating searchable PDFs out of scanned documents”. This implies OCRing the PDF and creating the text layer.
I was just not sure that I am not missing a bundle (that is not a flatpak) which would contain that.

Again, thank you!

P.S. Is @Businux a bot?

LOL, no :smile:

Give Recoll a try, you might like it.

2 Likes

Haha, thank you, @Businux! :slight_smile:

I have tried Recoll couple of years ago and it is indeed a good tool. I just need an option to OCR 500+ PDF files, so I would be able to pdfgrep them later. That’s why I am looking for a command line solution, so I can write a script to do the job.
I’ll take a look at DocFetcher though; didn’t know it.

Thanks a lot anyway!

1 Like

I guess ripgrep-all shall be fast enough, given that ripgrep is the fastest among grep-like tools.

1 Like

Since very recently, Recoll has a Gnome Shell Search Provider.

Perfect for Clear Linux integration.

Unlike Nepomuk and similar software, it doesn’t run as a daemon, just on demand, saving resources.

2 Likes

I’ve added a tesseract bundle, although we don’t bundle the trained language models, so you’ll have to download those and set the TESSDATA_PREFIX variable to point to the directory containing them.

1 Like

Awesome! :slight_smile:
Setting TESSDATA_PREFIX is easy.
Thank you so much @btwarden!

2 Likes