본문 바로가기
개발/Java

[JAVA] OCR - tesseract

by 바이너리10 2021. 6. 26.
반응형

테서렉트의 목적

  • 광학 문자 인식 OCR(Optical character recognition)
  • 이미지(사진)에서 글자를 추출

테서렉트의 장점

  • 무료

 

테서렉트의 단점

  • 한글 인식률이 생각보다 저조하다

구현

<!-- https://mvnrepository.com/artifact/net.sourceforge.tess4j/tess4j -->
<dependency>
	<groupId>net.sourceforge.tess4j</groupId>
	<artifactId>tess4j</artifactId>
	<version>4.5.4</version>
</dependency>
package spider.binaries.app.util;

import java.io.File;

import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;

public class TesseractTest {

	public static void main(String[] args) {
		String result = ocrImage("PATH");
		System.out.println(result);
	}
	
	
	private static Tesseract getTesseract() {
		
		Tesseract instance = new Tesseract();
		instance.setDatapath("/Users/devs/Downloads/tessdata");
		instance.setLanguage("kor");//"kor+eng"
		return instance;
		
	}

	private static String ocrImage(String fileName) {
		
		Tesseract tesseract = getTesseract();
		
		String result = null;
		
		File file = new File("/Users/devs/Downloads/tessdata/download/"+fileName);
		
		if(file.exists() && file.canRead()) {
			try {
				result = tesseract.doOCR(file);
			} catch (TesseractException e) {
				result = e.getMessage();
			}	
		} else {
			result = "not exist";
		}
		return result;

	}
		
}

 

설치시 주의 사항

 

 

출처 : https://github.com/tesseract-ocr/tesseract

          https://tesseract-ocr.github.io/

          https://en.wikipedia.org/wiki/Tesseract_(software)

 

tesseract-ocr/tesseract

Tesseract Open Source OCR Engine (main repository) - tesseract-ocr/tesseract

github.com

 

 

Tesseract documentation

Documentation

tesseract-ocr.github.io

 

 

Tesseract (software) - Wikipedia

TesseractTesseract 4.1.1 reading an image.Original author(s)Ray Smith, Hewlett-Packard[1]Developer(s)GoogleStable release4.1.1 / December 26, 2019; 17 months ago (2019-12-26)[2] Repository Written inC and C++Operating systemLinux, Windows, and macOS (x86

en.wikipedia.org

 

반응형

'개발 > Java' 카테고리의 다른 글

[Java] PDF - iText  (0) 2021.07.12
[Java] PDF - pdfbox  (0) 2021.07.08
[Java ] OCR - 결과 비교 Tesseract, Google Vision  (2) 2021.07.02
[JAVA] OCR - Google Vision  (0) 2021.06.28