Skip to main content

Parse Text from PDF with OCR using Apache Tika

Apache Tika is opensource software working about OCR from PDF, Image file. In this example is using Java Maven project to work with Apache Tika.

First of all, you need to add dependencies for using Apache Tika by add these dependencies into pom.xml

        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-parsers</artifactId>
            <version>1.18</version>
        </dependency>
        <dependency>
            <groupId>com.levigo.jbig2</groupId>
            <artifactId>levigo-jbig2-imageio</artifactId>
            <version>2.0</version>
        </dependency>  
        <dependency>
            <groupId>com.github.jai-imageio</groupId>
            <artifactId>jai-imageio-core</artifactId>
            <version>1.4.0</version>
        </dependency>
        <dependency>
            <groupId>org.xerial</groupId>
            <artifactId>sqlite-jdbc</artifactId>
            <version>3.23.1</version>
        </dependency>
        <dependency>
            <groupId>com.github.jai-imageio</groupId>
            <artifactId>jai-imageio-jpeg2000</artifactId>
            <version>1.3.0</version>
        </dependency>

And then import these library to your class


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.List;

import org.apache.tika.config.TikaConfig;
import org.apache.tika.exception.TikaException;
import org.apache.tika.language.detect.LanguageDetector;
import org.apache.tika.language.detect.LanguageResult;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.ocr.TesseractOCRConfig;
import org.apache.tika.parser.pdf.PDFParserConfig;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

Below is example code to parsing pdf file to text

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
InputStream pdf = null;
try {
 pdf = Files.newInputStream(Paths.get("c:/myfolder/myfile.pdf"));
} catch (IOException e) {
 // TODO Auto-generated catch block
 e.printStackTrace();
}
ByteArrayOutputStream out = new ByteArrayOutputStream();
 
TikaConfig config = TikaConfig.getDefaultConfig();
// TikaConfig fromFile = new TikaConfig("/path/to/file");
BodyContentHandler handler = new BodyContentHandler(out);
Parser parser = new AutoDetectParser(config);
Metadata meta = new Metadata();
ParseContext parsecontext = new ParseContext();

PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
List<LanguageDetector> listDetector = LanguageDetector.getLanguageDetectors();
String language = "eng";
if (listDetector.size() > 0) {
 LanguageDetector detector = listDetector.get(0);
 LanguageResult languageResult = detector.detect("กขค");
 language = languageResult.getLanguage();
}
System.out.println("Language Result = "+language);
TesseractOCRConfig tesserConfig = new TesseractOCRConfig();
tesserConfig.setLanguage(language);
tesserConfig.setTesseractPath("C:/PROGRA~2/Tesseract-OCR");
   
parsecontext.set(Parser.class, parser);
parsecontext.set(PDFParserConfig.class, pdfConfig);
parsecontext.set(TesseractOCRConfig.class, tesserConfig);  
  
try {
 parser.parse(pdf, handler, meta, parsecontext);
} catch (IOException e) {
 // TODO Auto-generated catch block
 e.printStackTrace();
} catch (SAXException e) {
 // TODO Auto-generated catch block
 e.printStackTrace();
} catch (TikaException e) {
 // TODO Auto-generated catch block
 e.printStackTrace();
}
System.out.println(new String(out.toByteArray(), Charset.defaultCharset()));  

Explain about the code

  • Line 3
    Setup your input file this file can be PDF or JPG file that up to your content.
  • Line 10 - 18
    Setup Configuration for parser.
  • Line 19 - 26
    Try to get language by language detector if the detector is not available will be use variable language at line 20.
  • Line 27 - 33
    Setup TesseractOCR.
  • Line 36
    Begin parse text from input file from line 3.
  • Line 47
    Print output as string using machine's default character set.








Comments

Popular posts from this blog

Install Spring Boot application as a Windows services.

I using  WinSW  to be wrapper for Spring Boot application to run as a Windows service (following section 61.3 of Spring Boot document). There few easy step to setup. Download WinSW binary distribution from website  https://github.com/kohsuke/winsw/releases Copy WinSW.exe into Spring Boot application folder (ex: my file is WinSW.Net4.exe) Rename your WinSW.exe to same as your jar file (for easy to remember). Create XML file name same as jar file. This file is using for configuration of Windows services. Put configuration for services wrapper in your xml file. 1 2 3 4 5 6 7 8 9 <?xml version="1.0" encoding="UTF-8"?> <service> <id>my-application-0.0.1</id> <name>my-application-0.0.1</name> <description>My Exaple Spring Boot Services</description> <executable>java</executable> <arguments>-jar -Xmx1024M -Xms128M "my-application-0.0.1.jar"</arguments> <logmode>rotate...

Serialize and Deserialize JSON data with C#

This post is about example for serialize and deserialize json data with C#. This example using Newtonsoft.Json to be a library to work with json data. You can add Newtonsoft.Json by download from  https://www.newtonsoft.com/json  or using Nuget to add it into your project. In this example is Bill object that hold billing data about Car object. Serialize Object to String Below is example code to serialize object to json string. At line 39 is code to convert object into json string with beautiful json format. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 using Newtonsoft.Json; using System; using System.Collections.Generic; using System.Diagnostics; using System.Linq; using System.Text; namespace JSON { class JSonExample { static void Main( string [] args) { List<Car> cars = new List<Car>(); float to...

CURL SSL error in WAMP

I facing problem about certificate error when I using curl to request HTTPS domain. I find cause of problem is there is no certificate configuration in PHP. Below is how to solve my problem. Here is sample of error: * About to connect() to notify-api.line.me port 443 (#0) * Trying {IP Address}... * connected * Connected to notify-api.line.me ({IP Address}) port 443 (#0) * SSL certificate problem, verify that the CA cert is OK. Details: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed * Closing connection #0 Step to solve this problem Download  Certificate Bundle Extract and put PEM file to your web server folder or other folder. Enable mod_ssl in Apache and php_openssl.dll in php.ini Add configuration into php.ini curl.cainfo="C:/wamp/cacert.pem" openssl.cafile="C:/wamp/cacert.pem" Restart Apache Service