Profiling data in PowerPivot: Detecting duplicates
October 15, 2012 2 Comments
Generally – when you are given an unexplored dataset – your first task should be to identify some of its properties in order to make a more informed decision regarding the accuracy of reporting you can achieve with it. Many Business Intelligence professionals have robust tools at their disposal that can be used to deeply explore a dataset; however, data analysts can use the inherent capabilities of PowerPivot to get the job done.
One of the most common data profiling assessments is identifying duplicate values. This is important as we want to decide on what sets of data can be good candidates for generating lookup tables in our model; And this is of particular relevancy when querying transactional sources that do not have clearly defined dimensions.
For example, say you want to load a product table into your model. In this case you are actually querying the enterprise data warehouse and the table you are using stores data for a slowly changing dimension of type II. For this reason it can have legitimate reasons for having duplicate product names: it is storing the product attributes as they have changed over time:
Notice the highlighted product (AWC Logo Cap) has three entries on that table. Over time, the product had different standard costs associated with it and the data warehousing team wanted to keep all history associated to past records.
If we intend, however, to mesh this data with another dataset that somebody sent to us (which did not come from the data warehouse and only has product name on it – not product key), we would want to know
a) are there duplicates on this table. We want to ensure this table can be used to generate a valid lookup table in the model, and PowerPivot requires unique values on the key columns of the lookup table when establishing relationships
b) If there are duplicate values, are they a data quality problem or based on legitimate reasons.
In our example, we know there are no data quality issues and we could easily generate a valid lookup out this table simply by going to the table import wizard and selecting the [EnglishProductName] field from while specifying a ‘Grouped by’ aggregation (using the same techniques detailed on this blog entry), or by issuing a T-SQL DISTINCT clause on the underlying query.
If the duplicates are based on data quality issues, we could instead ask the data source owner to fix them directly, even before it makes it to PowerPivot.
So now that we have determined that it is worthwhile identifying which records are duplicate, we need to know how to do this quickly using DAX.
On the dataset above, this can be done using the following DAX calculation on a new column:
=CALCULATE( COUNTROWS( DimProduct ),
ALLEXCEPT( DimProduct, DimProduct[EnglishProductName] ) )
In other words, we want to count all rows on the table by establishing a filter context in which only the column we are testing is on it:
=CALCULATE( COUNTROWS( ‘Table’),
ALLEXCEPT( ‘Table’, ‘Table’[Column Tested for Duplicates] ) )
It is easy then to filter that column to only detect the products with more than one entry on that table. Notice how the column filter also quickly tells us there are products with 2 entries and others with 3 entries:
I used DAX’s ALLEXCEPT function, as we only wanted to test duplicates over one product attribute ([EnglishProductName]). Had we been detecting for duplicates on a one-column table (one with only the column we are testing), we could have avoided this:
=CALCULATE( COUNTROWS( ‘DimProduct’ ) )
This is because on a one-column table, there is no need to exclude other attributes that can lead to duplicates. On the other hand, when other columns exist, there is always the risk of having attributes with different values for the same dimension member (products, in this case). An example would be the one mentioned above in which the product standard cost has changed over time. As such, we must remove all those attributes from the filter context when we want to narrow our investigation to only duplicate names.
Once you have identified the problematic records, it will be easier to communicate an issue to the data source owner and more rapidly identify a course of action. Even in the case you decide to use this table as a lookup in your data model, you will be now aware of potential future issues for which you can proactively plan for.