Info
We can combine
poppler
and tesseract
to perform OCR (Optical Character Recognition) on scanned PDF files on Termux (an Android app).
Althought the postprocessing task is often needed, it is still better than re-typing everything from the zero.
Installation and Usage
On Termux:
apt install tesseract ocrad poppler
- Extract PDF pages to images
Once poppler
is installed, we will have a few pdfto...
commands. Here we will use this pdftoppm
See pdftoppm -h
for more help details.
With
pdftoppm -jpeg -f 100 -l 102 scanned.pdf img
we will get
img-100.jpg
img-101.jpg
img-102.jpg
The option -f, -l
stand for the first and last page number (range) to be extracted.
Without these parameters it will convert all pages.
pdftoppm -jpeg scanned.pdf img
- Download tesseract trained data
And download pre-trained data files for your languages. The filename will be in this pattern ABC.traineddata
. For example: jpn.traineddata
,jpn.traineddata
Once they are downloaded, use cp
command and copy them to tessdata directory: /data/data/com.termux/files/usr/share/tessdata
For example, if I have downloaded the files to the phone download folder, then the command will be:
cp /storage/emulated/0/Download/jpn.traineddata /data/data/com.termux/files/usr/share/tessdata/
- To copy many files at once:
cp /storage/emulated/0/Download/*.traineddata /data/data/com.termux/files/usr/share/tessdata/
- Now you can check available language models:
tesseract --list-langs
- OCR
Run tesseract -h
for its concise help.
The basic syntax is:
tesseract imagename outputbase -l eng
The option -l eng
tells Tessaract that the language is English.
For example, with
tesseract img-100.jpg text_page100 -l eng
We will get text_page100.txt
.
We can also combine language models, see the advanced users session below:
For advanced users
Tip: create a new folder for images when the file has many pages. Then use a bash script to process all.
mkdir -p img
imgDPI=300
pdftoppm -jpeg -r $imgDPI scanned.pdf img/p
The -r 300
: higher res 300 dpi (default 150 dpi).
A bash script to do all:
mkdir -p img
mkdir -p text/img
imgDPI=300
pdftoppm -jpeg -r $imgDPI scanned.pdf img/p
for i in img/*.jpg; do
echo processing $i
tesseract "$i" "text/${i%%.*}" --dpi $imgDPI -l eng
done
To join all txt files into one file
Simply use this Neko 🐈 cat
(meow) command to join:
cat text/img/*.txt > all_in_one.txt
# or
# cat text/**/*.txt > all_in_one.txt
More advanced
Read the extra help tesseract --help-extra
and play with parameters to see which ones are more suitable for a particular document.
Here is the one for my case:
mkdir -p img
mkdir -p text/img
imgDPI=300
pdftoppm -jpeg -r $imgDPI scanned.pdf img/p
for i in img/*.jpg; do
echo processing $i
tesseract "$i" "text/${i%%.*}" --psm 4 --loglevel ERROR --dpi $imgDPI -l eng
done
To combine languages
If the document has multi languages, you can combine the trainned data. For example:
-l IAST+eng
is for Sanskrit IAST and English. It will be useful in some cases.
I tried both -l IAST+eng
and -l eng+IAST
, it seems there was no differences for the output.
mkdir -p img
mkdir -p text/img
imgDPI=300
pdftoppm -jpeg -r $imgDPI scanned.pdf img/p
for i in img/*.jpg; do
echo processing $i
tesseract "$i" "text/${i%%.*}" --psm 4 --loglevel ERROR --dpi $imgDPI -l IAST+eng
done
More on Sanskrit (IAST) and pāḷi (IAST)
- Sanskrit (IAST) trained data: https://github.com/Shreeshrii/tesstrain-Sanskrit-IAST
Note: The test below is done on Termux
tesseract
, on a standard linux computer, the results may be different (not tested yet).
I also tried using this for pāḷi (IAST) on Termux:
-
Need to combine with eng, since was failed to recognise some English word.
-
The trained data IASTuned_0.101000_1896935_14312000.traineddata seems to have better performance for my particular document.
-
At 150 dpi, all failed.
-
Use higher res 300 dpi will yeild better results.
The original phrase :
Alaṅkataṁ devapuraṁva rammaṁ
(from The Great Chronicle Of Buddhas - page 454)
Data | Img 150 dpi | Img 200 dpi | Img 300 dpi | 100% accuracy |
---|---|---|---|---|
eng+IAST | Alaikataṁ devapuranva rammaṁ | Alaṅkataṁ devapuraṁva rammani | Alaṅkataṁ devapuraṁva rammaṁ | X X V |
eng+IASTuned_0.088000_1951804_15423100 | Alaṅkataṁ devapuraṁva ramma | Alaṅkataṁ devapuramva rammaṁ | Alaṅkataṁ devapuraṁva rammaṁ | X X V |
eng+IASTuned_0.101000_1896935_14312000 | Alaṅkataṁ devapuraṁva ramma | Alaṅkataṁ devapuraṁva rammaṁ | Alaṅkataṁ devapuraṁva rammaṁ | X V V |
- 150 vs 200 vs 300 dpi command
tesseract 300/p-0500.jpg 300p-500_psm --psm 4 --loglevel ALL --dpi 300 -l eng+IAST
tesseract 300/p-0500.jpg 300p-500_psm01 --psm 4 --loglevel ALL --dpi 300 -l eng+IASTuned_0.101000_1896935_14312000
tesseract 300/p-0500.jpg 300p-500_psm08 --psm 4 --loglevel ALL --dpi 300 -l eng+IASTuned_0.088000_1951804_15423100
tesseract 200/p-0500.jpg 200p-500_psm --psm 4 --loglevel ALL --dpi 200 -l eng+IAST
tesseract 200/p-0500.jpg 200p-500_psm01 --psm 4 --loglevel ALL --dpi 200 -l eng+IASTuned_0.101000_1896935_14312000
tesseract 200/p-0500.jpg 200p-500_psm08 --psm 4 --loglevel ALL --dpi 200 -l eng+IASTuned_0.088000_1951804_15423100
tesseract 150/p-0500.jpg 150p-500_psm --psm 4 --loglevel ALL --dpi 150 -l eng+IAST
tesseract 150/p-0500.jpg 150p-500_psm01 --psm 4 --loglevel ALL --dpi 150 -l eng+IASTuned_0.101000_1896935_14312000
tesseract 150/p-0500.jpg 150p-500_psm08 --psm 4 --loglevel ALL --dpi 150 -l eng+IASTuned_0.088000_1951804_15423100
Finally this command is used for my doc.
mkdir -p img
mkdir -p text/img
imgDPI=300
FILE='scanned.pdf'
pdftoppm -jpeg -r $imgDPI "$FILE" img/p
# "${i%%.*}" below is to remove file ext
for i in img/*.jpg; do
echo "=== processing $i"
tesseract "$i" "text/${i%%.*}" --psm 4 --loglevel ALL --dpi $imgDPI -l eng+IASTuned_0.101000_1896935_14312000
done;
echo "done"
Misc
-
If you have thousands of pages to be extracted, it is better to use a computer or a cloud computer to do so.
-
tesseract –version
tesseract --version
tesseract 5.1.0
leptonica-1.82.0
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.3) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.5.0
Found NEON
Found libcurl/7.83.1 OpenSSL/3.0.3 zlib/1.2.12 libssh2/1.10.0 nghttp2/1.47.0
- tesseract –list-langs
tesseract --list-langs
List of available languages in "/data/data/com.termux/files/usr/share/tessdata/" (6):
IAST
IASTuned_0.088000_1951804_15423100
IASTuned_0.101000_1896935_14312000
eng
osd
vie
Pre-processing images which are yellow in color before doing OCR
- If the scanned image are yellow etc…, can try the script
textcleaner
http://www.fmwconcepts.com/imagemagick/textcleaner/index.php
textcleaner -g -e stretch -f 25 -o 5 -s 1 p-223.jpg 223.jpg