How to Extract Data from PDF to Excel Using Python

Cody Schneider8 min read

Trying to copy and paste tables from a PDF into Excel is a recipe for frustration. What you get is often a garbled mess of text that takes longer to clean up than it would to retype manually. If this is a regular part of your workflow, there’s a much better way. This guide will walk you through exactly how to use Python to automatically pull data from PDFs and save it into a perfectly formatted Excel file.

GraphedGraphed

Still Building Reports Manually?

Watch how growth teams are getting answers in seconds — not days.

Watch Graphed demo video

Why Use Python for This Task?

While there are online converters, they often have limitations on file size, privacy concerns, or just don't handle complex layouts well. Python, on the other hand, gives you complete control and repeatability. Once you write a script, you can reuse it for dozens or thousands of similar files without any extra effort.

Here’s why it’s a great choice:

  • It's free: Python and its data-processing libraries are open-source, so you can build powerful tools without paying for expensive software licenses.
  • It's powerful: Python can handle PDFs with dozens of pages, multiple tables, and weird formatting that would break simpler tools.
  • It’s automatable: It allows you to build a reliable solution that you can run again and again, saving you countless hours of manual work.

Setting Up Your Environment

Before we can start extracting data, we need to install a few essential Python libraries. Don't worry if this sounds intimidating, it's just a couple of simple commands. We’ll be using three main libraries:

  • Tabula-py: An excellent library specifically designed to find and extract tables from PDFs.
  • Pandas: The most popular library in Python for data manipulation and analysis. We'll use it to organize the extracted data and save it to an Excel file.
  • Openpyxl: A library that Pandas uses behind the scenes to write to .xlsx files.

You can install all three with one command. Open your terminal (or Command Prompt on Windows) and type the following:

pip install tabula-py pandas openpyxl

One crucial note: The tabula-py library is a Python wrapper for a Java library called Tabula. This means you must have Java installed on your computer for it to work. If you don't have it, don't worry - you can grab the latest version from the official Java Downloads page. A quick download and install is all it takes.

Extracting a Simple Table from a PDF

Let's start with a common scenario: you have a one-page PDF with a single, well-defined table. Our goal is to pull that table into an Excel spreadsheet.

Imagine you have a PDF file named financial_summary.pdf in the same folder as your Python script.

Here is the full Python script to do it:

import tabula import pandas as pd

GraphedGraphed

Still Building Reports Manually?

Watch how growth teams are getting answers in seconds — not days.

Watch Graphed demo video

Define the path to your PDF file

file_path = "financial_summary.pdf"

Use tabula to read the table(s) from the PDF

This returns a list of pandas DataFrames

tables = tabula.read_pdf(file_path, pages='1')

For this simple case, we know there's only one table, so we grab the first item

if tables: df = tables[0]

Define the name for your output Excel file

output_excel_path = "financial_output.xlsx"

Use pandas to save the DataFrame to an Excel file

The index=False argument prevents pandas from writing row indices to the file

df.to_excel(output_excel_path, index=False)

print(f"Table successfully extracted and saved to {output_excel_path}") else: print("No tables found in the PDF.")

Breaking Down the Code

  • import tabula and import pandas as pd: We start by importing the libraries we need. pd is the standard, community-accepted alias for pandas.
  • file_path = "financial_summary.pdf": We store the name of our PDF in a variable for easy access.
  • tables = tabula.read_pdf(file_path, pages='1'): This is where the magic happens. We tell tabula to read our PDF file. The read_pdf function always returns a list of tables because some pages might have more than one. We use pages='1' to tell it to only look at the first page.
  • df = tables[0]: Since tables is a list, we access the first (and in this case, only) table with tables[0]. This table is now a pandas DataFrame - a structure similar to an Excel sheet, with rows and columns. This is stored in the df variable.
  • df.to_excel(...): The to_excel method is a built-in pandas function that does exactly what it says: it writes the DataFrame to an Excel file. We include index=False because, by default, pandas adds a numbered index column, this argument prevents that.

Run this script, and you'll find a new Excel file named financial_output.xlsx in your folder containing the exact data from the PDF table.

Handling More Complex PDFs

Real-world PDFs are rarely that simple. Tables might span multiple pages, or a single page might contain several tables you need to extract. Here’s how to handle these common challenges.

Extracting Tables from Multiple Pages

What if your report has a table on every page? Instead of running the script page by page, you can instruct tabula-py to check all pages at once. Simply change the pages argument to 'all'.

Read all tables from all pages of the document

all_tables = tabula.read_pdf(file_path, pages='all')

print(f"Found {len(all_tables)} tables in the entire document.")

You can also specify a range of pages, like pages='2-5'.

PDFs with Multiple Tables on One Page

If a single page contains multiple distinct tables, tabula.read_pdf will return a list with several DataFrames in it. A great way to handle this is to save each table to a different sheet within the same Excel file.

This script improves on our original by looping through all the found tables and writing them to sheets named 'Table 1', 'Table 2', and so on.

import tabula import pandas as pd

file_path = "report_with_multiple_tables.pdf" output_excel_path = "multi_table_output.xlsx"

GraphedGraphed

Still Building Reports Manually?

Watch how growth teams are getting answers in seconds — not days.

Watch Graphed demo video

This gives us a list of all tables found

tables_on_page = tabula.read_pdf(file_path, pages='1', multiple_tables=True)

if tables_on_page:

Use ExcelWriter to save multiple DataFrames to one file with multiple sheets

with pd.ExcelWriter(output_excel_path) as writer: for i, df in enumerate(tables_on_page): # We name each sheet sequentially df.to_excel(writer, sheet_name=f'Table_{i+1}', index=False)

print(f"{len(tables_on_page)} tables saved to different sheets in {output_excel_path}") else: print("No tables were found.")

Helping Tabula Find Tricky Tables

Sometimes, PDFs without clear border lines can confuse tabula's auto-detection. If it’s failing to pick up a table or grabbing extra unwanted text, you can tell it exactly where to look by specifying a coordinate-based area. The area parameter takes a list of four values: [top, left, bottom, right]. These coordinates are measured in points from the top-left corner of the page.

You can use a tool like Adobe Acrobat or a free online alternative to find these coordinates. It may take some trial and error, but it gives you fine-tuned control.

Coordinates: [top, left, bottom, right] in points

Tweak these values to select the exact area of your table

table_location = [150, 50, 750, 550]

specific_table = tabula.read_pdf( file_path, pages=1, area=table_location )

Always Clean Your Data

Data extraction is often an imperfect process. You might end up with empty rows, incorrect headers, or columns of numbers that are mistakenly read as text. This is where pandas truly shines. Before saving your data to Excel, you can perform quick cleaning operations to make it perfect.

Here are a few common cleaning steps:

GraphedGraphed

Still Building Reports Manually?

Watch how growth teams are getting answers in seconds — not days.

Watch Graphed demo video

Let's assume 'df' is a messy DataFrame we just extracted

if not df.empty: # 1. Drop rows that are completely empty df.dropna(how='all', inplace=True)

# 2. Reset the column headers if the first row is actually the header
new_header = df.iloc[0] # Grab the first row
df = df[1:] # Drop the original header row
df.columns = new_header # Set the header row to the correct values

# 3. Rename a specific column for clarity
df.rename(columns={'Yr End Date': 'Year End'}, inplace=True)

# 4. Convert a column to a numeric type if it was read as text
# 'errors='coerce'' turns any non-numeric values into NaN (Not a Number)
if 'Revenue' in df.columns:
    df['Revenue'] = pd.to_numeric(df['Revenue'], errors='coerce')

# Now the data is clean and ready to save
df.to_excel("cleaned_output.xlsx", index=False)

A Quick Note on Scanned PDFs (Images)

It's important to understand there are two types of PDFs: "true" (text-based) and "scanned" (image-based). All the methods above work flawlessly on true PDFs where the text and tables are encoded in the file.

If your PDF is essentially a photograph of a document, Python can't "read" the text directly. In that case, you need to use Optical Character Recognition (OCR) to convert the image into text first. Libraries such as pytesseract can handle this, but it adds another layer of complexity. If tabula-py returns nothing from your file, there's a good chance you’re working with a scanned PDF.

Final Thoughts

Using Python, tabula-py, and pandas is a seriously powerful way to automate one of the most tedious data tasks around. Once set up, this process lets you pull information locked in PDFs directly into a structured, usable Excel format, saving you from a world of manual copy-pasting and cleanup.

Of course, PDFs are just one piece of the data puzzle. We know that getting clear insights often involves pulling data from a dozen different places, like Google Analytics, Shopify, Salesforce, and Facebook Ads. This is precisely why we created Graphed. It connects directly to all your marketing and sales data sources, allowing you to create real-time dashboards and reports just by asking questions in plain English - no coding skills required.

Related Articles