Parse Text from PDF with OCR using Apache Tika

Apache Tika is opensource software working about OCR from PDF, Image file. In this example is using Java Maven project to work with Apache Tika.

First of all, you need to add dependencies for using Apache Tika by add these dependencies into pom.xml

        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-parsers</artifactId>
            <version>1.18</version>
        </dependency>
        <dependency>
            <groupId>com.levigo.jbig2</groupId>
            <artifactId>levigo-jbig2-imageio</artifactId>
            <version>2.0</version>
        </dependency>  
        <dependency>
            <groupId>com.github.jai-imageio</groupId>
            <artifactId>jai-imageio-core</artifactId>
            <version>1.4.0</version>
        </dependency>
        <dependency>
            <groupId>org.xerial</groupId>
            <artifactId>sqlite-jdbc</artifactId>
            <version>3.23.1</version>
        </dependency>
        <dependency>
            <groupId>com.github.jai-imageio</groupId>
            <artifactId>jai-imageio-jpeg2000</artifactId>
            <version>1.3.0</version>
        </dependency>

And then import these library to your class

import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.List;

import org.apache.tika.config.TikaConfig;
import org.apache.tika.exception.TikaException;
import org.apache.tika.language.detect.LanguageDetector;
import org.apache.tika.language.detect.LanguageResult;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.ocr.TesseractOCRConfig;
import org.apache.tika.parser.pdf.PDFParserConfig;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

Below is example code to parsing pdf file to text

InputStream pdf = null;
try {
 pdf = Files.newInputStream(Paths.get("c:/myfolder/myfile.pdf"));
} catch (IOException e) {
 // TODO Auto-generated catch block
 e.printStackTrace();
}
ByteArrayOutputStream out = new ByteArrayOutputStream();
 
TikaConfig config = TikaConfig.getDefaultConfig();
// TikaConfig fromFile = new TikaConfig("/path/to/file");
BodyContentHandler handler = new BodyContentHandler(out);
Parser parser = new AutoDetectParser(config);
Metadata meta = new Metadata();
ParseContext parsecontext = new ParseContext();

PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
List<LanguageDetector> listDetector = LanguageDetector.getLanguageDetectors();
String language = "eng";
if (listDetector.size() > 0) {
 LanguageDetector detector = listDetector.get(0);
 LanguageResult languageResult = detector.detect("กขค");
 language = languageResult.getLanguage();
}
System.out.println("Language Result = "+language);
TesseractOCRConfig tesserConfig = new TesseractOCRConfig();
tesserConfig.setLanguage(language);
tesserConfig.setTesseractPath("C:/PROGRA~2/Tesseract-OCR");
   
parsecontext.set(Parser.class, parser);
parsecontext.set(PDFParserConfig.class, pdfConfig);
parsecontext.set(TesseractOCRConfig.class, tesserConfig);  
  
try {
 parser.parse(pdf, handler, meta, parsecontext);
} catch (IOException e) {
 // TODO Auto-generated catch block
 e.printStackTrace();
} catch (SAXException e) {
 // TODO Auto-generated catch block
 e.printStackTrace();
} catch (TikaException e) {
 // TODO Auto-generated catch block
 e.printStackTrace();
}
System.out.println(new String(out.toByteArray(), Charset.defaultCharset()));

Explain about the code

Line 3
Setup your input file this file can be PDF or JPG file that up to your content.
Line 10 - 18
Setup Configuration for parser.
Line 19 - 26
Try to get language by language detector if the detector is not available will be use variable language at line 20.
Line 27 - 33
Setup TesseractOCR.
Line 36
Begin parse text from input file from line 3.
Line 47
Print output as string using machine's default character set.

DEV Fixxer

Search This Blog

Parse Text from PDF with OCR using Apache Tika

Labels

Comments

Post a Comment

Popular posts from this blog

Install Spring Boot application as a Windows services.

Serialize and Deserialize JSON data with C#

CURL SSL error in WAMP