Info

We can combine poppler and tesseract to perform OCR (Optical Character Recognition) on scanned PDF files on Termux (an Android app).

Althought the postprocessing task is often needed, it is still better than re-typing everything from the zero.

Installation and Usage

On Termux:

apt install tesseract ocrad  poppler
  1. Extract PDF pages to images

Once poppler is installed, we will have a few pdfto... commands. Here we will use this pdftoppm

See pdftoppm -h for more help details.

With

pdftoppm -jpeg -f 100 -l 102 scanned.pdf img

we will get

img-100.jpg
img-101.jpg
img-102.jpg

The option -f, -l stand for the first and last page number (range) to be extracted.

Without these parameters it will convert all pages.

pdftoppm -jpeg scanned.pdf img

  1. Download tesseract trained data

And download pre-trained data files for your languages. The filename will be in this pattern ABC.traineddata. For example: jpn.traineddata,jpn.traineddata

Once they are downloaded, use cp command and copy them to tessdata directory: /data/data/com.termux/files/usr/share/tessdata

For example, if I have downloaded the files to the phone download folder, then the command will be:

cp /storage/emulated/0/Download/jpn.traineddata /data/data/com.termux/files/usr/share/tessdata/
  • To copy many files at once:

cp /storage/emulated/0/Download/*.traineddata /data/data/com.termux/files/usr/share/tessdata/

  • Now you can check available language models:

tesseract --list-langs

  1. OCR

Run tesseract -h for its concise help.

The basic syntax is: tesseract imagename outputbase -l eng

The option -l eng tells Tessaract that the language is English.

For example, with tesseract img-100.jpg text_page100 -l eng

We will get text_page100.txt.

We can also combine language models, see the advanced users session below:

For advanced users

Tip: create a new folder for images when the file has many pages. Then use a bash script to process all.

mkdir -p img
imgDPI=300
pdftoppm -jpeg -r $imgDPI scanned.pdf  img/p

The -r 300: higher res 300 dpi (default 150 dpi).

A bash script to do all:

mkdir -p img
mkdir -p text/img
imgDPI=300

pdftoppm -jpeg -r $imgDPI scanned.pdf  img/p

for i in img/*.jpg; do 
    echo processing $i
    tesseract "$i" "text/${i%%.*}" --dpi $imgDPI -l eng
done

To join all txt files into one file

Simply use this Neko 🐈 cat (meow) command to join:

cat text/img/*.txt > all_in_one.txt
# or
# cat text/**/*.txt > all_in_one.txt

More advanced

Read the extra help tesseract --help-extra and play with parameters to see which ones are more suitable for a particular document.

Here is the one for my case:


mkdir -p img
mkdir -p text/img
imgDPI=300

pdftoppm -jpeg -r $imgDPI scanned.pdf  img/p

for i in img/*.jpg; do 
    echo processing $i
    tesseract "$i" "text/${i%%.*}" --psm 4 --loglevel ERROR --dpi $imgDPI -l eng
done

To combine languages

If the document has multi languages, you can combine the trainned data. For example:

-l IAST+eng is for Sanskrit IAST and English. It will be useful in some cases.

I tried both -l IAST+eng and -l eng+IAST, it seems there was no differences for the output.

mkdir -p img
mkdir -p text/img
imgDPI=300

pdftoppm -jpeg -r $imgDPI scanned.pdf  img/p

for i in img/*.jpg; do 
    echo processing $i
    tesseract "$i" "text/${i%%.*}" --psm 4 --loglevel ERROR --dpi $imgDPI -l IAST+eng
done

More on Sanskrit (IAST) and pāḷi (IAST)

Note: The test below is done on Termux tesseract, on a standard linux computer, the results may be different (not tested yet).

I also tried using this for pāḷi (IAST) on Termux:

  • Need to combine with eng, since was failed to recognise some English word.

  • The trained data IASTuned_0.101000_1896935_14312000.traineddata seems to have better performance for my particular document.

  • At 150 dpi, all failed.

  • Use higher res 300 dpi will yeild better results.

The original phrase : Alaṅkataṁ devapuraṁva rammaṁ
(from The Great Chronicle Of Buddhas - page 454)

Data Img 150 dpi Img 200 dpi Img 300 dpi 100% accuracy
eng+IAST Alaikataṁ devapuranva rammaṁ Alaṅkataṁ devapuraṁva rammani Alaṅkataṁ devapuraṁva rammaṁ X X V
eng+IASTuned_0.088000_1951804_15423100 Alaṅkataṁ devapuraṁva ramma Alaṅkataṁ devapuramva rammaṁ Alaṅkataṁ devapuraṁva rammaṁ X X V
eng+IASTuned_0.101000_1896935_14312000 Alaṅkataṁ devapuraṁva ramma Alaṅkataṁ devapuraṁva rammaṁ Alaṅkataṁ devapuraṁva rammaṁ X V V
  • 150 vs 200 vs 300 dpi command

tesseract 300/p-0500.jpg 300p-500_psm --psm 4 --loglevel ALL --dpi 300 -l eng+IAST

tesseract 300/p-0500.jpg 300p-500_psm01 --psm 4 --loglevel ALL --dpi 300 -l eng+IASTuned_0.101000_1896935_14312000

tesseract 300/p-0500.jpg 300p-500_psm08 --psm 4 --loglevel ALL --dpi 300 -l eng+IASTuned_0.088000_1951804_15423100

tesseract 200/p-0500.jpg 200p-500_psm --psm 4 --loglevel ALL --dpi 200 -l eng+IAST

tesseract 200/p-0500.jpg 200p-500_psm01 --psm 4 --loglevel ALL --dpi 200 -l eng+IASTuned_0.101000_1896935_14312000

tesseract 200/p-0500.jpg 200p-500_psm08 --psm 4 --loglevel ALL --dpi 200 -l eng+IASTuned_0.088000_1951804_15423100

tesseract 150/p-0500.jpg 150p-500_psm --psm 4 --loglevel ALL --dpi 150 -l eng+IAST

tesseract 150/p-0500.jpg 150p-500_psm01 --psm 4 --loglevel ALL --dpi 150 -l eng+IASTuned_0.101000_1896935_14312000

tesseract 150/p-0500.jpg 150p-500_psm08 --psm 4 --loglevel ALL --dpi 150 -l eng+IASTuned_0.088000_1951804_15423100

Finally this command is used for my doc.

mkdir -p img
mkdir -p text/img

imgDPI=300
FILE='scanned.pdf'

pdftoppm -jpeg -r $imgDPI "$FILE" img/p


# "${i%%.*}" below is to remove file ext
for i in img/*.jpg; do 
    echo "=== processing $i"
    tesseract "$i" "text/${i%%.*}" --psm 4 --loglevel ALL --dpi $imgDPI -l eng+IASTuned_0.101000_1896935_14312000
done;

echo "done"

Misc

  • If you have thousands of pages to be extracted, it is better to use a computer or a cloud computer to do so.

  • tesseract –version

tesseract --version

tesseract 5.1.0
 leptonica-1.82.0
  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.3) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.5.0
 Found NEON
 Found libcurl/7.83.1 OpenSSL/3.0.3 zlib/1.2.12 libssh2/1.10.0 nghttp2/1.47.0

  • tesseract –list-langs
tesseract --list-langs 
List of available languages in "/data/data/com.termux/files/usr/share/tessdata/" (6):
IAST
IASTuned_0.088000_1951804_15423100
IASTuned_0.101000_1896935_14312000
eng
osd
vie

Pre-processing images which are yellow in color before doing OCR

  • If the scanned image are yellow etc…, can try the script textcleaner

http://www.fmwconcepts.com/imagemagick/textcleaner/index.php

textcleaner -g -e stretch -f 25 -o 5 -s 1 p-223.jpg 223.jpg