How to Turn PDF Documents into Data Tables with Python

Extracting tabular data from PDFs doesn’t have to be a nightmare. Here’s a beginner-friendly guide with real code examples.

How to Turn PDF Documents into Data Tables with Python
Photo by CURVD® on Unsplash

PDFs weren’t made for data analysis — but with the right Python tricks, they don’t stand a chance.

How to Turn PDF Documents into Data Tables with Python

PDFs are everywhere. From invoices and research papers to bank statements and annual reports, businesses and developers constantly encounter them.

The problem? PDFs are built for presentation — not for data extraction.

If you’ve ever tried pulling tables out of a PDF and feeding them into a CSV or a database, you know it’s not as simple as copying and pasting. But with Python, we can bridge that gap efficiently.

In this guide, I’ll walk you through how to extract data tables from PDF documents using Python — turning frustrating static content into usable structured data.


Tools We’ll Use

To get started, you’ll need a few libraries:

  • pdfplumber — For reading and parsing the contents of PDF files.
  • pandas — To handle the extracted tabular data.
  • tabula-py (optional) — If you’re dealing with complex or scanned tables.

Let’s install them:

pip install pdfplumber pandas 
pip install tabula-py # Optional: If you want to use tabula-py

Step 1: Read and Explore Your PDF

Let’s say you have a PDF file named invoice.pdf, and you're trying to extract a billing table from it.

Start by opening the PDF and inspecting its contents:

import pdfplumber 
 
with pdfplumber.open("invoice.pdf") as pdf: 
    first_page = pdf.pages[0] 
    print(first_page.extract_text())

This gives you a sense of how the data is laid out and whether a table exists in a machine-readable form.

Step 2: Extracting Tables Using pdfplumber

Once you’ve confirmed that there’s a table, extract it like this:

import pandas as pd 
 
with pdfplumber.open("invoice.pdf") as pdf: 
    page = pdf.pages[0] 
    table = page.extract_table() 
     
    df = pd.DataFrame(table[1:], columns=table[0]) 
    print(df)

What’s happening here?

  • extract_table() returns a list of lists, representing rows.
  • The first row is treated as headers.
  • We convert it into a DataFrame for clean and structured access.

Sample Output

| Item      | Quantity | Price | Total | 
|-----------|----------|-------|-------| 
| Widget A  | 2        | $10   | $20   | 
| Widget B  | 1        | $15   | $15   |

After conversion:

print(df)
Item  Quantity Price Total 
0  Widget A        2   $10   $20 
1  Widget B        1   $15   $15

Now you can export this to CSV:

df.to_csv("invoice_data.csv", index=False)

What If Tables Aren’t Detected?

Sometimes, extract_table() might return None—especially if the table is embedded in an image or not properly aligned.

That’s when tabula-py comes in handy.

Step 3: Extracting Complex Tables with tabula-py

tabula-py is a Python wrapper for the popular Java tool Tabula, which uses OCR and heuristics to detect tables in PDFs.

Requirements:
You need Java installed on your machine.

import tabula 
 
# Extract all tables from a PDF 
tables = tabula.read_pdf("invoice.pdf", pages='all', multiple_tables=True) 
 
for i, table in enumerate(tables): 
    print(f"Table {i}:\n", table)

Bonus Tip: Batch Process PDFs

If you’re dealing with hundreds of PDFs (say monthly invoices), automate it:

import os 
 
directory = "invoices" 
all_data = [] 
 
for filename in os.listdir(directory): 
    if filename.endswith(".pdf"): 
        with pdfplumber.open(os.path.join(directory, filename)) as pdf: 
            page = pdf.pages[0] 
            table = page.extract_table() 
            if table: 
                df = pd.DataFrame(table[1:], columns=table[0]) 
                df["source"] = filename 
                all_data.append(df) 
 
final_df = pd.concat(all_data) 
final_df.to_csv("all_invoices.csv", index=False)

Cleaning Up Extracted Data

Most PDF tables won’t be perfect. You may need to:

  • Remove empty rows/columns
  • Standardize column names
  • Strip unwanted characters like $ or %
  • Convert string numbers into proper data types
df["Price"] = df["Price"].str.replace('$', '').astype(float)

Final Thoughts

PDFs aren’t naturally made for data science or automation workflows. But with tools like pdfplumber and tabula-py, Python can tame even the messiest tables.

Whether you’re building an invoicing dashboard, parsing research data, or extracting analytics from reports, this workflow will save you hours of manual work.

Resources & References


Let’s Connect

Have questions or dealing with a tricky PDF? Drop a comment — I love helping developers extract insights from the unexpected.

If you found this useful, share it with your team or follow for more Python productivity tips!

Photo by Adam Nowakowski on Unsplash