Introduction to pdfminer.six
pdfminer.six is a powerful library in Python for extracting text, images, and other information from PDF documents. It provides a range of APIs that allow developers to perform various operations on PDF files, making it an indispensable tool for those who need to automate PDF processing tasks.
Getting Started with pdfminer.six
To get started with pdfminer.six, you need to first install it using pip:
pip install pdfminer.six
Extracting Text with pdfminer.six
To extract text from a PDF, you can use the following code snippet:
from pdfminer.high_level import extract_text def extract_text_from_pdf(pdf_path): text = extract_text(pdf_path) return text pdf_path = 'sample.pdf' extracted_text = extract_text_from_pdf(pdf_path) print(extracted_text)
Extracting Images from PDF
pdfminer.six also allows you to extract images from PDF files. Here is an example:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import PDFPageAggregator from pdfminer.layout import LAParams, LTImage from pdfminer.pdfpage import PDFPage def extract_images_from_pdf(pdf_path): rsrcmgr = PDFResourceManager() laparams = LAParams() device = PDFPageAggregator(rsrcmgr, laparams=laparams) interpreter = PDFPageInterpreter(rsrcmgr, device) images = [] with open(pdf_path, 'rb') as fp: for page in PDFPage.get_pages(fp): interpreter.process_page(page) layout = device.get_result() for element in layout: if isinstance(element, LTImage): images.append(element) return images pdf_path = 'sample.pdf' extracted_images = extract_images_from_pdf(pdf_path) for img in extracted_images: print(f'Image: {img.name}')
Extracting Metadata from PDF
Here is how you can extract metadata from a PDF file:
from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfparser import PDFParser def extract_metadata(pdf_path): with open(pdf_path, 'rb') as fp: parser = PDFParser(fp) document = PDFDocument(parser) metadata = document.info[0] if document.info else {} return metadata pdf_path = 'sample.pdf' metadata = extract_metadata(pdf_path) for key, value in metadata.items(): print(f'{key}: {value}')
PDFMiner Example Application
Let’s create a small application that uses the above APIs to extract text, images, and metadata from a PDF file.
from pdfminer.high_level import extract_text from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import PDFPageAggregator from pdfminer.layout import LAParams, LTImage from pdfminer.pdfpage import PDFPage from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfparser import PDFParser def extract_text_from_pdf(pdf_path): return extract_text(pdf_path) def extract_metadata(pdf_path): with open(pdf_path, 'rb') as fp: parser = PDFParser(fp) document = PDFDocument(parser) return document.info[0] if document.info else {} def extract_images_from_pdf(pdf_path): rsrcmgr = PDFResourceManager() laparams = LAParams() device = PDFPageAggregator(rsrcmgr, laparams=laparams) interpreter = PDFPageInterpreter(rsrcmgr, device) images = [] with open(pdf_path, 'rb') as fp: for page in PDFPage.get_pages(fp): interpreter.process_page(page) layout = device.get_result() for element in layout: if isinstance(element, LTImage): images.append(element) return images def main(pdf_path): text = extract_text_from_pdf(pdf_path) metadata = extract_metadata(pdf_path) images = extract_images_from_pdf(pdf_path) print("Text extracted:") print(text) print("\nMetadata extracted:") for key, value in metadata.items(): print(f"{key}: {value}") print("\nImages extracted:") for img in images: print(f"Image: {img.name}") if __name__ == "__main__": pdf_path = 'sample.pdf' main(pdf_path)
Hash: 92cceb5b922ca708645242b172030aa73ba58a0dd1e26a08d65048d07c02d94b