How to Remove Duplicates in Power BI

Cody Schneider

Duplicate data in your reports can quietly wreck their credibility, producing skewed charts and inflated totals that lead to bad business decisions. Cleaning these duplicates is a fundamental step in building reliable Power BI dashboards. In this guide, we’ll walk through several straightforward methods to find and eliminate duplicate records using Power BI's built-in tools.

Why Duplicates Are a Serious Problem in Power BI

Before diving into the "how," it's worth understanding the "why." Duplicate rows aren't just a minor annoyance, they actively damage your data model and visualizations. Here’s a quick rundown of the problems they cause:

  • Inflated Metrics: This is the most obvious issue. If you have duplicate sales records, your total revenue will be wrong. Measures like SUM(), COUNT(), and AVERAGE() will all return incorrect values, making performance look better (or worse) than it actually is.

  • Relationship Issues: Power BI models rely on relationships between tables, often using a unique key column (like an Order ID or Customer ID). Duplicates in a key column on the "one" side of a one-to-many relationship will cause the relationship to fail or behave unexpectedly.

  • Slow Performance: More rows mean a larger data model. A bloated model with unnecessary data consumes more memory and makes your reports slower to load and interact with.

  • Lost Trust: When stakeholders spot inconsistencies in the data, they lose trust in the report and the analyst who built it. Accurate data is the foundation of any useful dashboard.

Getting Started: Entering the Power Query Editor

Most data cleaning tasks in Power BI happen in the Power Query Editor, not in the main report view. This is the "back-end" of your report where you shape, clean, and transform your data before it gets loaded into the data model.

To access it, go to the Home tab in the main Power BI Desktop window and click on Transform data. This will open the Power Query Editor, where we'll perform all the steps below.

Method 1: The Simple Solution – 'Remove Duplicates'

For many scenarios, removing duplicates is just a two-click process. This method works by identifying rows that are identical based on the columns you select.

Let's say you have a simple table of customer mailing list sign-ups, and you want to make sure each email address appears only once.

Step-by-Step Instructions:

  1. In the Power Query Editor, find and select the column that defines the duplicate. In our example, this is the 'Email Address' column. Just click the column header to select it.

  2. With the column selected, navigate to the Home tab in the Power Query ribbon.

  3. Click the Remove Rows dropdown and select Remove Duplicates.

That's it! Power BI will scan the 'Email Address' column and remove any rows containing an email that has already appeared in a previous row.

Handling Duplicates Across Multiple Columns

Sometimes a duplicate is defined by a combination of columns. For instance, a duplicate order might have the same 'OrderID' and the same 'ProductID'. To handle this, simply select all the relevant columns before clicking Remove Duplicates.

Hold down the Ctrl key while clicking each column header (e.g., 'OrderID', 'ProductID'). Once all are selected, follow the same steps as above: Home > Remove Rows > Remove Duplicates.

Important Note: The standard 'Remove Duplicates' feature keeps the first occurrence of a value it finds and removes the subsequent ones. The order of your data matters. If you need to keep the last record (e.g., the most recent entry), you must first sort your data accordingly (e.g., sort by date descending) before using this feature.

Method 2: Investigating Your Data First with 'Keep Duplicates'

Sometimes you don't want to immediately delete duplicates. You may need to review them first to understand why they exist. Were they caused by a data entry error? A system glitch? The Keep Duplicates feature is perfect for this kind of detective work.

It works like the 'Remove Duplicates' feature, but in reverse - it removes all the unique rows and leaves you with only the rows that have one or more duplicates in your selected column(s). In essence, you are isolating the problem rows for review.

Step-by-Step Instructions:

  1. Select the column(s) that you suspect contain duplicates.

  2. Navigate to the Home tab.

  3. Click the Keep Rows dropdown and select Keep Duplicates.

Power Query will then filter your table to show only the records that are problematic. This gives you a clean view of the duplicates you need to handle, which you can then export or analyze further before deciding on a final cleaning strategy.

Method 3: Advanced Duplicate Removal with Grouping

The simple 'Remove Duplicates' tool always keeps the first record it finds. But what if you need more control? What if you want to keep the record with the most recent date, the highest sale value, or the most complete information?

This is where 'Group By' comes in. This advanced technique allows you to define your own logic for which record to keep. Let's use an example: we have multiple product update records for the same 'ProductID', and we need to keep only the one with the most recent 'LastModifiedDate'.

Step-by-Step Instructions:

  1. Group Your Data: In the Power Query Editor, go to the Transform tab and click Group By.

  2. Configure the Grouping:

    • In the 'Group By' window, select the column that identifies your unique entities (e.g., 'ProductID').

    • Give your new column a name, such as "AllData".

    • For the Operation, choose All Rows.

When you click OK, your table will shrink. You'll have one row for each 'ProductID', and a new column ("AllData") that contains a nested table with all the original rows for that specific 'ProductID'.

  1. Add a Custom Column to Select the Best Record:

    • Now, go to the Add Column tab and click Custom Column.

    • Name your new column, perhaps "LatestRecord".

    • Enter the following formula in the editor. This formula tells Power Query to find the row with the maximum 'LastModifiedDate' within each nested table:

    Table.Max([AllData], "LastModifiedDate")

  2. Expand and Clean Up:

    • You should now have a column of [Record] objects. Click the expand icon (two opposing arrows) in the header of your "LatestRecord" column. Deselect "Use original column name as prefix" and click OK.

    • This will expand the record to bring back all your original columns, but now you’ll only have the single record you wanted to keep for each 'ProductID'.

    • Finally, remove the now-unnecessary "AllData" column to finish the process. Right-click its header and choose 'Remove'.

This grouping method is incredibly powerful because you can swap out the custom column formula to fit your needs. Instead of Table.Max(), you could use Table.Min() to get the oldest record or you could first sort the table using Table.Sort() then take the first record with Table.FirstN() for more complex logic.

Method 4: Handling Duplicates with DAX

While Power Query is the preferred place to physically remove duplicates from your model, you can also handle them virtually on the front-end using DAX formulas. This doesn't clean your data model but instead performs calculations that ignore duplicates on the fly. This can be useful when you want to preserve the raw data but need a distinct count in a specific visual.

The main DAX functions for this are DISTINCT() and VALUES().

Using DISTINCTCOUNT for a Simple Count

The most common use case is getting a distinct count of something. Instead of dropping a 'CustomerID' field into a card visual and using the default Count aggregation (which would include duplicates), you can create an explicit measure:

Unique Customer Count = DISTINCTCOUNT('Sales'[CustomerID])

This creates a new measure that will always calculate the number of unique customer IDs, regardless of how many times each one appears in your data table.

Calculating Sums over Unique Values with SUMX

Another common scenario is needing to sum values across a duplicated list. Imagine you have a table where customer subscription revenue is listed every time a project happens, causing the annual revenue to be duplicated. To calculate the actual total revenue, you would need to sum the revenue for each unique customer only once.

This formula works in a few steps:

  • VALUES('Subscriptions'[CustomerID]) creates a temporary, single-column table of unique Customer IDs.

  • SUMX() iterates through each unique Customer ID in that table.

  • For each customer, CALCULATE(MAX('Subscriptions'[AnnualRevenue])) finds their corresponding annual revenue. We use MAX here with the assumption that the duplicated revenue value is the same for each instance, so MAX, MIN, or AVERAGE would all return the correct number.

  • Finally, SUMX sums up the revenue for each unique customer, giving you an accurate total.

Remember, the DAX approach doesn't change your underlying data. It's best practice to clean data in Power Query for a leaner, more performant, and more intuitive model. But for quick, specific calculations, DAX is a fantastic tool.

Final Thoughts

Cleaning duplicate data is an unavoidable and essential step for anyone building reports in Power BI. Whether you choose the simple push-button 'Remove Duplicates' feature or use more advanced techniques with grouping in Power Query, taking the time to ensure your data is unique and accurate is what separates a confusing report from one that drives real, dependable insights.

The manual data preparation we covered, while powerful, highlights the friction that often exists between raw data and a finished report. To make this process feel less like a chore, we built Graphed. It connects to your data sources and automates the entire reporting process, from cleaning and preparation to analysis and visualization, all guided by simple natural language. Instead of stepping through multiple menus to remove duplicates, you can just build your report and let our AI analyst ensure the underlying data is accurate and ready for analysis.