Since this is my first posting here, let me start with first things first.
Thank you for the great work creating an awesome distro!
Now to my question:
Does anyone have a suggestion about creating searchable PDFs out of scanned documents?
I was not able to get information on how to run e.g. pypdfocr or pdfsandwich. I tried to find a way to install tesseract but I can get it installed only as a part of a flatpak (which is not an option using it on a command line).
I just need to be able making a bunch of PDFs searchable. Anyone has an idea?
For OCR, tesseract is a solid choice and you can compile it from the source. Instructions are available here. I think most dependencies are available via swupd, but I’m not sure about Leptonica and you might need to compile it before tesseract.
Thank you, @doct0rHu!
This is the information that I needed: I need to compile tesseract from source.
I understand the two parts of creating a searchable PDF and this is the reason I wrote that I need “creating searchable PDFs out of scanned documents”. This implies OCRing the PDF and creating the text layer.
I was just not sure that I am not missing a bundle (that is not a flatpak) which would contain that.
I have tried Recoll couple of years ago and it is indeed a good tool. I just need an option to OCR 500+ PDF files, so I would be able to pdfgrep them later. That’s why I am looking for a command line solution, so I can write a script to do the job.
I’ll take a look at DocFetcher though; didn’t know it.
I’ve added a tesseract bundle, although we don’t bundle the trained language models, so you’ll have to download those and set the TESSDATA_PREFIX variable to point to the directory containing them.