본문 바로가기
개발/Java

[Java] PDF - pdfbox

by 바이너리10 2021. 7. 8.
반응형

pdfbox의 목적

  • PDF 문서 작성 
  • PDF 문서 추출 및 편집

pdfbox의 장점

  • 무료
  • PDF 문서내의 글자를 추출 혹은 편집 가능
  • PDF 파일을 여러개로 분리 혹은 병합 가능
  • PDF를 이미지(PNG or JPEG)로 변환 가능
  • PDF 싸인 가능

pdfbox의 단점

  • 아직 단점은 잘 모르겠음

구현

<!-- https://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox -->
<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>2.0.24</version>
</dependency>
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.List;

import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.io.RandomAccessBufferedFileInputStream;
import org.apache.pdfbox.io.RandomAccessRead;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentInformation;
import org.apache.pdfbox.rendering.ImageType;
import org.apache.pdfbox.rendering.PDFRenderer;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.tools.imageio.ImageIOUtil;


public class PdfReaderUtil {

	public static void main(String args[]) {
		System.out.println("start!");
		PdfReaderUtil pdfReaderUtil = new PdfReaderUtil();

		try {
			String text1 = pdfReaderUtil.readPdfBox1(filePath);
			String text2 = pdfReaderUtil.readPdfBox2(filePath);
			System.out.println("readPdfBox1:"+text1);
			System.out.println("readPdfBox2:"+text2);
		} catch (Exception e) {
			System.out.println("exception : " + e);
		}
        
		System.out.println("end!");
	}

	public static String readPdfBox1(String filePath) throws Exception  {
		File file = new File(filePath);
		PDDocument pDDocument = PDDocument.load(file);
		String text = new PDFTextStripper().getText(pDDocument);
		return text;
	}

	public static String readPdfBox2(String filePath) throws Exception {

		InputStream inputStream = new FileInputStream(new File(filePath));
		RandomAccessRead source = new RandomAccessBufferedFileInputStream(inputStream);
		PDFParser pDFParser = new PDFParser(source);
		pDFParser.parse();
		PDDocument pDDocument = pDFParser.getPDDocument();
		COSDocument cOSDocument = pDFParser.getDocument();
		PDFTextStripper pDFTextStripper = new PDFTextStripper();
		PDDocumentInformation pDDocumentInformation = pDDocument.getDocumentInformation();
		pDFTextStripper.setLineSeparator("\n");
		pDFTextStripper.setWordSeparator(" ");
		
		String result = pDFTextStripper.getText(pDDocument);
		
		System.out.println("Total : " + pDDocument.getNumberOfPages());
		
		cOSDocument.close();
		pDDocument.close();
		
		return result;
	}
}
반응형

'개발 > Java' 카테고리의 다른 글

[Java] PDF - iText  (0) 2021.07.12
[Java ] OCR - 결과 비교 Tesseract, Google Vision  (5) 2021.07.02
[JAVA] OCR - Google Vision  (0) 2021.06.28
[JAVA] OCR - tesseract  (0) 2021.06.26