반응형
테서렉트의 목적
광학문자 인식 OCR(Optical character recognition)- 이미지(사진)에서 글자를 추출
테서렉트의 장점
- 무료
테서렉트의 단점
- 한글 인식률이 생각보다 저조하다
구현
<!-- https://mvnrepository.com/artifact/net.sourceforge.tess4j/tess4j -->
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>4.5.4</version>
</dependency>
package spider.binaries.app.util;
import java.io.File;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
public class TesseractTest {
public static void main(String[] args) {
String result = ocrImage("PATH");
System.out.println(result);
}
private static Tesseract getTesseract() {
Tesseract instance = new Tesseract();
instance.setDatapath("/Users/devs/Downloads/tessdata");
instance.setLanguage("kor");//"kor+eng"
return instance;
}
private static String ocrImage(String fileName) {
Tesseract tesseract = getTesseract();
String result = null;
File file = new File("/Users/devs/Downloads/tessdata/download/"+fileName);
if(file.exists() && file.canRead()) {
try {
result = tesseract.doOCR(file);
} catch (TesseractException e) {
result = e.getMessage();
}
} else {
result = "not exist";
}
return result;
}
}
설치시 주의 사항
출처 : https://github.com/tesseract-ocr/tesseract
https://tesseract-ocr.github.io/
https://en.wikipedia.org/wiki/Tesseract_(software)
tesseract-ocr/tesseract
Tesseract Open Source OCR Engine (main repository) - tesseract-ocr/tesseract
github.com
Tesseract documentation
Documentation
tesseract-ocr.github.io
Tesseract (software) - Wikipedia
TesseractTesseract 4.1.1 reading an image.Original author(s)Ray Smith, Hewlett-Packard[1]Developer(s)GoogleStable release4.1.1 / December 26, 2019; 17 months ago (2019-12-26)[2] Repository Written inC and C++Operating systemLinux, Windows, and macOS (x86
en.wikipedia.org
반응형
'개발 > Java' 카테고리의 다른 글
[Java] PDF - iText (0) | 2021.07.12 |
---|---|
[Java] PDF - pdfbox (0) | 2021.07.08 |
[Java ] OCR - 결과 비교 Tesseract, Google Vision (2) | 2021.07.02 |
[JAVA] OCR - Google Vision (0) | 2021.06.28 |