Skip to main content

Parse Text from PDF with OCR using Apache Tika

Apache Tika is opensource software working about OCR from PDF, Image file. In this example is using Java Maven project to work with Apache Tika.

First of all, you need to add dependencies for using Apache Tika by add these dependencies into pom.xml

        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-parsers</artifactId>
            <version>1.18</version>
        </dependency>
        <dependency>
            <groupId>com.levigo.jbig2</groupId>
            <artifactId>levigo-jbig2-imageio</artifactId>
            <version>2.0</version>
        </dependency>  
        <dependency>
            <groupId>com.github.jai-imageio</groupId>
            <artifactId>jai-imageio-core</artifactId>
            <version>1.4.0</version>
        </dependency>
        <dependency>
            <groupId>org.xerial</groupId>
            <artifactId>sqlite-jdbc</artifactId>
            <version>3.23.1</version>
        </dependency>
        <dependency>
            <groupId>com.github.jai-imageio</groupId>
            <artifactId>jai-imageio-jpeg2000</artifactId>
            <version>1.3.0</version>
        </dependency>

And then import these library to your class


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.List;

import org.apache.tika.config.TikaConfig;
import org.apache.tika.exception.TikaException;
import org.apache.tika.language.detect.LanguageDetector;
import org.apache.tika.language.detect.LanguageResult;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.ocr.TesseractOCRConfig;
import org.apache.tika.parser.pdf.PDFParserConfig;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

Below is example code to parsing pdf file to text

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
InputStream pdf = null;
try {
 pdf = Files.newInputStream(Paths.get("c:/myfolder/myfile.pdf"));
} catch (IOException e) {
 // TODO Auto-generated catch block
 e.printStackTrace();
}
ByteArrayOutputStream out = new ByteArrayOutputStream();
 
TikaConfig config = TikaConfig.getDefaultConfig();
// TikaConfig fromFile = new TikaConfig("/path/to/file");
BodyContentHandler handler = new BodyContentHandler(out);
Parser parser = new AutoDetectParser(config);
Metadata meta = new Metadata();
ParseContext parsecontext = new ParseContext();

PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
List<LanguageDetector> listDetector = LanguageDetector.getLanguageDetectors();
String language = "eng";
if (listDetector.size() > 0) {
 LanguageDetector detector = listDetector.get(0);
 LanguageResult languageResult = detector.detect("กขค");
 language = languageResult.getLanguage();
}
System.out.println("Language Result = "+language);
TesseractOCRConfig tesserConfig = new TesseractOCRConfig();
tesserConfig.setLanguage(language);
tesserConfig.setTesseractPath("C:/PROGRA~2/Tesseract-OCR");
   
parsecontext.set(Parser.class, parser);
parsecontext.set(PDFParserConfig.class, pdfConfig);
parsecontext.set(TesseractOCRConfig.class, tesserConfig);  
  
try {
 parser.parse(pdf, handler, meta, parsecontext);
} catch (IOException e) {
 // TODO Auto-generated catch block
 e.printStackTrace();
} catch (SAXException e) {
 // TODO Auto-generated catch block
 e.printStackTrace();
} catch (TikaException e) {
 // TODO Auto-generated catch block
 e.printStackTrace();
}
System.out.println(new String(out.toByteArray(), Charset.defaultCharset()));  

Explain about the code

  • Line 3
    Setup your input file this file can be PDF or JPG file that up to your content.
  • Line 10 - 18
    Setup Configuration for parser.
  • Line 19 - 26
    Try to get language by language detector if the detector is not available will be use variable language at line 20.
  • Line 27 - 33
    Setup TesseractOCR.
  • Line 36
    Begin parse text from input file from line 3.
  • Line 47
    Print output as string using machine's default character set.








Comments

Popular posts from this blog

CURL SSL error in WAMP

I facing problem about certificate error when I using curl to request HTTPS domain. I find cause of problem is there is no certificate configuration in PHP. Below is how to solve my problem. Here is sample of error: * About to connect() to notify-api.line.me port 443 (#0) * Trying {IP Address}... * connected * Connected to notify-api.line.me ({IP Address}) port 443 (#0) * SSL certificate problem, verify that the CA cert is OK. Details: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed * Closing connection #0 Step to solve this problem Download  Certificate Bundle Extract and put PEM file to your web server folder or other folder. Enable mod_ssl in Apache and php_openssl.dll in php.ini Add configuration into php.ini curl.cainfo="C:/wamp/cacert.pem" openssl.cafile="C:/wamp/cacert.pem" Restart Apache Service

Bootstrap 4.1.3

Hot on the heels of v4.1.2, we’re shipping another patch release to address an issue with our browserslist config, fix some CSS bugs, make JavaScript plugins UMD ready, and improve form control rendering. Up next will be v4.2, our second minor release where we add some new features. But first, here are the highlights for v4.1.3. Pay attention to the change to  .form-control s which adds a new fixed  height . Fixed:  Moved the browserslist config from our  package.json  to a separate file to avoid unintended inherited browser settings across npm projects. Fixed:  Removed the  :not(:root)  selector from our  svg  Reboot styles, resolving an issue that caused all inline SVGs ignore  vertical-align  styles via single class due to higher specificity. Fixed:  Buttons in custom file inputs are once again clickable when focused. Improved:  Bootstrap’s plugins can now be imported separately in any contexts because they...

Install Spring Boot application as a Windows services.

I using  WinSW  to be wrapper for Spring Boot application to run as a Windows service (following section 61.3 of Spring Boot document). There few easy step to setup. Download WinSW binary distribution from website  https://github.com/kohsuke/winsw/releases Copy WinSW.exe into Spring Boot application folder (ex: my file is WinSW.Net4.exe) Rename your WinSW.exe to same as your jar file (for easy to remember). Create XML file name same as jar file. This file is using for configuration of Windows services. Put configuration for services wrapper in your xml file. 1 2 3 4 5 6 7 8 9 <?xml version="1.0" encoding="UTF-8"?> <service> <id>my-application-0.0.1</id> <name>my-application-0.0.1</name> <description>My Exaple Spring Boot Services</description> <executable>java</executable> <arguments>-jar -Xmx1024M -Xms128M "my-application-0.0.1.jar"</arguments> <logmode>rotate...