Read time: 1 minute
For this purpose, you may read my previous related post here.
I am going to introduce (again) to the tesseract OCR engine. But this time I am using 16.04 and the command to install it is:
sudo apt install tesseract-ocr
If you have some PDF and want it to convert to image to further process it. You may use various methods. One of them may be:
convert input.pdf output.png
But this will produce a relatively low-resolution image that may result in bad text out of OCR.
So, instead we use:
convert -density 300 -quality 100 input.pdf output.png
Changing the density and tell it to not to decrease the quality than 100%.
Note if the input.pdf is a multi-page PDF, it will create different output images named like: output-0.png, output-1.png and so on.
So finally, use tesseract as:
tesseract output.png text_file -l eng
It will create a text_file.txt in the same directory. You may play with various options of convert or tesseract based on your needs.