What is PyPDF2?
PyPDF2 is a powerful and user-friendly Python library used for manipulating PDF files. Whether you need to extract text, merge or split PDF files, modify metadata, or encrypt your documents for security, PyPDF2 has you covered. This versatile library is essential for developers dealing with PDFs.
Getting Started with PyPDF2
To begin using PyPDF2, you can install it via pip:
pip install PyPDF2
Useful APIs and Examples
1. Extracting Text from a PDF
Extracting text from a PDF is one of the most common tasks. Here’s how you can do it:
from PyPDF2 import PdfReader
reader = PdfReader("example.pdf")
for page in reader.pages:
print(page.extract_text())
2. Merging Multiple PDFs
With PyPDF2, you can merge multiple PDF files into a single document:
from PyPDF2 import PdfMerger
merger = PdfMerger()
merger.append("file1.pdf")
merger.append("file2.pdf")
merger.write("merged_document.pdf")
merger.close()
3. Splitting a PDF into Individual Pages
If you need to split a PDF into separate pages, you can do so easily:
from PyPDF2 import PdfReader, PdfWriter
reader = PdfReader("example.pdf")
for page_number, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
with open(f"page_{page_number + 1}.pdf", "wb") as output_file:
writer.write(output_file)
4. Adding a Password to a PDF
Secure your document by encrypting it with a password:
from PyPDF2 import PdfWriter
writer = PdfWriter()
writer.append("example.pdf")
writer.encrypt("securepassword")
with open("encrypted_document.pdf", "wb") as output_file:
writer.write(output_file)
5. Extracting PDF Metadata
You can also extract metadata such as the title, author, and subject of a PDF:
from PyPDF2 import PdfReader
reader = PdfReader("example.pdf")
print(reader.metadata)
6. Adding a Watermark to a PDF
Add a watermark to a PDF file using PyPDF2:
from PyPDF2 import PdfReader, PdfWriter
reader = PdfReader("document.pdf")
watermark = PdfReader("watermark.pdf")
watermark_page = watermark.pages[0]
writer = PdfWriter()
for page in reader.pages:
page.merge_page(watermark_page)
writer.add_page(page)
with open("watermarked_document.pdf", "wb") as output_file:
writer.write(output_file)
A Real Application Example: Automating Invoice Processing
Imagine you run a business and need to process PDF invoices to extract specific information such as vendor names, amounts, and dates. Here’s an example that utilizes the APIs introduced above:
from PyPDF2 import PdfReader
def extract_invoice_data(pdf_path):
reader = PdfReader(pdf_path)
data = []
for page in reader.pages:
text = page.extract_text()
# Simple parsing logic for extracting invoice details
if "Vendor:" in text:
vendor = text.split("Vendor:")[1].split("\n")[0]
if "Amount:" in text:
amount = text.split("Amount:")[1].split("\n")[0]
if "Date:" in text:
date = text.split("Date:")[1].split("\n")[0]
data.append({"vendor": vendor, "amount": amount, "date": date})
return data
invoices = extract_invoice_data("invoices.pdf")
for invoice in invoices:
print(invoice)
This example showcases the power of text extraction for practical business automation using PyPDF2.
Conclusion
PyPDF2 is a must-have tool in the toolkit of any Python developer dealing with PDF files. From basic operations like merging and splitting to advanced use-cases like automating invoice processing, PyPDF2 provides a wide range of features to make your life easier. Start exploring and see how it can streamline your workflows.