Apache Tika is opensource software working about OCR from PDF, Image file. In this example is using Java Maven project to work with Apache Tika.
First of all, you need to add dependencies for using Apache Tika by add these dependencies into pom.xml
And then import these library to your class
Below is example code to parsing pdf file to text
Explain about the code
First of all, you need to add dependencies for using Apache Tika by add these dependencies into pom.xml
<dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <version>1.18</version> </dependency> <dependency> <groupId>com.levigo.jbig2</groupId> <artifactId>levigo-jbig2-imageio</artifactId> <version>2.0</version> </dependency> <dependency> <groupId>com.github.jai-imageio</groupId> <artifactId>jai-imageio-core</artifactId> <version>1.4.0</version> </dependency> <dependency> <groupId>org.xerial</groupId> <artifactId>sqlite-jdbc</artifactId> <version>3.23.1</version> </dependency> <dependency> <groupId>com.github.jai-imageio</groupId> <artifactId>jai-imageio-jpeg2000</artifactId> <version>1.3.0</version> </dependency>
And then import these library to your class
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | import java.io.ByteArrayOutputStream; import java.io.IOException; import java.io.InputStream; import java.nio.charset.Charset; import java.nio.file.Files; import java.nio.file.Paths; import java.util.List; import org.apache.tika.config.TikaConfig; import org.apache.tika.exception.TikaException; import org.apache.tika.language.detect.LanguageDetector; import org.apache.tika.language.detect.LanguageResult; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.parser.ocr.TesseractOCRConfig; import org.apache.tika.parser.pdf.PDFParserConfig; import org.apache.tika.sax.BodyContentHandler; import org.xml.sax.SAXException; |
Below is example code to parsing pdf file to text
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | InputStream pdf = null; try { pdf = Files.newInputStream(Paths.get("c:/myfolder/myfile.pdf")); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } ByteArrayOutputStream out = new ByteArrayOutputStream(); TikaConfig config = TikaConfig.getDefaultConfig(); // TikaConfig fromFile = new TikaConfig("/path/to/file"); BodyContentHandler handler = new BodyContentHandler(out); Parser parser = new AutoDetectParser(config); Metadata meta = new Metadata(); ParseContext parsecontext = new ParseContext(); PDFParserConfig pdfConfig = new PDFParserConfig(); pdfConfig.setExtractInlineImages(true); List<LanguageDetector> listDetector = LanguageDetector.getLanguageDetectors(); String language = "eng"; if (listDetector.size() > 0) { LanguageDetector detector = listDetector.get(0); LanguageResult languageResult = detector.detect("กขค"); language = languageResult.getLanguage(); } System.out.println("Language Result = "+language); TesseractOCRConfig tesserConfig = new TesseractOCRConfig(); tesserConfig.setLanguage(language); tesserConfig.setTesseractPath("C:/PROGRA~2/Tesseract-OCR"); parsecontext.set(Parser.class, parser); parsecontext.set(PDFParserConfig.class, pdfConfig); parsecontext.set(TesseractOCRConfig.class, tesserConfig); try { parser.parse(pdf, handler, meta, parsecontext); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (SAXException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (TikaException e) { // TODO Auto-generated catch block e.printStackTrace(); } System.out.println(new String(out.toByteArray(), Charset.defaultCharset())); |
Explain about the code
- Line 3
Setup your input file this file can be PDF or JPG file that up to your content. - Line 10 - 18
Setup Configuration for parser. - Line 19 - 26
Try to get language by language detector if the detector is not available will be use variable language at line 20. - Line 27 - 33
Setup TesseractOCR. - Line 36
Begin parse text from input file from line 3. - Line 47
Print output as string using machine's default character set.
Comments
Post a Comment