How to Extract Data from PDF to Excel Using Python
Trying to copy and paste tables from a PDF into Excel is a recipe for frustration. What you get is often a garbled mess of text that takes longer to clean up than it would to retype manually. If this is a regular part of your workflow, there’s a much better way. This guide will walk you through exactly how to use Python to automatically pull data from PDFs and save it into a perfectly formatted Excel file.
Why Use Python for This Task?
While there are online converters, they often have limitations on file size, privacy concerns, or just don't handle complex layouts well. Python, on the other hand, gives you complete control and repeatability. Once you write a script, you can reuse it for dozens or thousands of similar files without any extra effort.
Here’s why it’s a great choice:
- It's free: Python and its data-processing libraries are open-source, so you can build powerful tools without paying for expensive software licenses.
- It's powerful: Python can handle PDFs with dozens of pages, multiple tables, and weird formatting that would break simpler tools.
- It’s automatable: It allows you to build a reliable solution that you can run again and again, saving you countless hours of manual work.
Setting Up Your Environment
Before we can start extracting data, we need to install a few essential Python libraries. Don't worry if this sounds intimidating, it's just a couple of simple commands. We’ll be using three main libraries:
- Tabula-py: An excellent library specifically designed to find and extract tables from PDFs.
- Pandas: The most popular library in Python for data manipulation and analysis. We'll use it to organize the extracted data and save it to an Excel file.
- Openpyxl: A library that Pandas uses behind the scenes to write to
.xlsxfiles.
You can install all three with one command. Open your terminal (or Command Prompt on Windows) and type the following:
pip install tabula-py pandas openpyxl
One crucial note: The tabula-py library is a Python wrapper for a Java library called Tabula. This means you must have Java installed on your computer for it to work. If you don't have it, don't worry - you can grab the latest version from the official Java Downloads page. A quick download and install is all it takes.
Extracting a Simple Table from a PDF
Let's start with a common scenario: you have a one-page PDF with a single, well-defined table. Our goal is to pull that table into an Excel spreadsheet.
Imagine you have a PDF file named financial_summary.pdf in the same folder as your Python script.
Here is the full Python script to do it:
import tabula import pandas as pd
Define the path to your PDF file
file_path = "financial_summary.pdf"
Use tabula to read the table(s) from the PDF
This returns a list of pandas DataFrames
tables = tabula.read_pdf(file_path, pages='1')
For this simple case, we know there's only one table, so we grab the first item
if tables: df = tables[0]
Define the name for your output Excel file
output_excel_path = "financial_output.xlsx"
Use pandas to save the DataFrame to an Excel file
The index=False argument prevents pandas from writing row indices to the file
df.to_excel(output_excel_path, index=False)
print(f"Table successfully extracted and saved to {output_excel_path}") else: print("No tables found in the PDF.")
Breaking Down the Code
import tabulaandimport pandas as pd: We start by importing the libraries we need.pdis the standard, community-accepted alias for pandas.file_path = "financial_summary.pdf": We store the name of our PDF in a variable for easy access.tables = tabula.read_pdf(file_path, pages='1'): This is where the magic happens. We telltabulato read our PDF file. Theread_pdffunction always returns a list of tables because some pages might have more than one. We usepages='1'to tell it to only look at the first page.df = tables[0]: Sincetablesis a list, we access the first (and in this case, only) table withtables[0]. This table is now a pandas DataFrame - a structure similar to an Excel sheet, with rows and columns. This is stored in thedfvariable.df.to_excel(...): Theto_excelmethod is a built-in pandas function that does exactly what it says: it writes the DataFrame to an Excel file. We includeindex=Falsebecause, by default, pandas adds a numbered index column, this argument prevents that.
Run this script, and you'll find a new Excel file named financial_output.xlsx in your folder containing the exact data from the PDF table.
Handling More Complex PDFs
Real-world PDFs are rarely that simple. Tables might span multiple pages, or a single page might contain several tables you need to extract. Here’s how to handle these common challenges.
Extracting Tables from Multiple Pages
What if your report has a table on every page? Instead of running the script page by page, you can instruct tabula-py to check all pages at once. Simply change the pages argument to 'all'.
Read all tables from all pages of the document
all_tables = tabula.read_pdf(file_path, pages='all')
print(f"Found {len(all_tables)} tables in the entire document.")
You can also specify a range of pages, like pages='2-5'.
PDFs with Multiple Tables on One Page
If a single page contains multiple distinct tables, tabula.read_pdf will return a list with several DataFrames in it. A great way to handle this is to save each table to a different sheet within the same Excel file.
This script improves on our original by looping through all the found tables and writing them to sheets named 'Table 1', 'Table 2', and so on.
import tabula import pandas as pd
file_path = "report_with_multiple_tables.pdf" output_excel_path = "multi_table_output.xlsx"
This gives us a list of all tables found
tables_on_page = tabula.read_pdf(file_path, pages='1', multiple_tables=True)
if tables_on_page:
Use ExcelWriter to save multiple DataFrames to one file with multiple sheets
with pd.ExcelWriter(output_excel_path) as writer: for i, df in enumerate(tables_on_page): # We name each sheet sequentially df.to_excel(writer, sheet_name=f'Table_{i+1}', index=False)
print(f"{len(tables_on_page)} tables saved to different sheets in {output_excel_path}") else: print("No tables were found.")
Helping Tabula Find Tricky Tables
Sometimes, PDFs without clear border lines can confuse tabula's auto-detection. If it’s failing to pick up a table or grabbing extra unwanted text, you can tell it exactly where to look by specifying a coordinate-based area. The area parameter takes a list of four values: [top, left, bottom, right]. These coordinates are measured in points from the top-left corner of the page.
You can use a tool like Adobe Acrobat or a free online alternative to find these coordinates. It may take some trial and error, but it gives you fine-tuned control.
Coordinates: [top, left, bottom, right] in points
Tweak these values to select the exact area of your table
table_location = [150, 50, 750, 550]
specific_table = tabula.read_pdf( file_path, pages=1, area=table_location )
Always Clean Your Data
Data extraction is often an imperfect process. You might end up with empty rows, incorrect headers, or columns of numbers that are mistakenly read as text. This is where pandas truly shines. Before saving your data to Excel, you can perform quick cleaning operations to make it perfect.
Here are a few common cleaning steps:
Let's assume 'df' is a messy DataFrame we just extracted
if not df.empty: # 1. Drop rows that are completely empty df.dropna(how='all', inplace=True)
# 2. Reset the column headers if the first row is actually the header
new_header = df.iloc[0] # Grab the first row
df = df[1:] # Drop the original header row
df.columns = new_header # Set the header row to the correct values
# 3. Rename a specific column for clarity
df.rename(columns={'Yr End Date': 'Year End'}, inplace=True)
# 4. Convert a column to a numeric type if it was read as text
# 'errors='coerce'' turns any non-numeric values into NaN (Not a Number)
if 'Revenue' in df.columns:
df['Revenue'] = pd.to_numeric(df['Revenue'], errors='coerce')
# Now the data is clean and ready to save
df.to_excel("cleaned_output.xlsx", index=False)A Quick Note on Scanned PDFs (Images)
It's important to understand there are two types of PDFs: "true" (text-based) and "scanned" (image-based). All the methods above work flawlessly on true PDFs where the text and tables are encoded in the file.
If your PDF is essentially a photograph of a document, Python can't "read" the text directly. In that case, you need to use Optical Character Recognition (OCR) to convert the image into text first. Libraries such as pytesseract can handle this, but it adds another layer of complexity. If tabula-py returns nothing from your file, there's a good chance you’re working with a scanned PDF.
Final Thoughts
Using Python, tabula-py, and pandas is a seriously powerful way to automate one of the most tedious data tasks around. Once set up, this process lets you pull information locked in PDFs directly into a structured, usable Excel format, saving you from a world of manual copy-pasting and cleanup.
Of course, PDFs are just one piece of the data puzzle. We know that getting clear insights often involves pulling data from a dozen different places, like Google Analytics, Shopify, Salesforce, and Facebook Ads. This is precisely why we created Graphed. It connects directly to all your marketing and sales data sources, allowing you to create real-time dashboards and reports just by asking questions in plain English - no coding skills required.
Related Articles
What SEO Tools Work with Google Analytics?
Discover which SEO tools integrate seamlessly with Google Analytics to provide a comprehensive view of your site's performance. Optimize your SEO strategy now!
Looker Studio vs Metabase: Which BI Tool Actually Fits Your Team?
Looker Studio and Metabase both help you turn raw data into dashboards, but they take completely different approaches. This guide breaks down where each tool fits, what they are good at, and which one matches your actual workflow.
How to Create a Photo Album in Meta Business Suite
How to create a photo album in Meta Business Suite — step-by-step guide to organizing Facebook and Instagram photos into albums for your business page.