Automate lookup table maintenance when using multiple Excel sources in a PowerPivot model

When developing a PowerPivot model, one of the most important tasks is deciding which tables will serve as lookups.  The urgency is due the fact that those tables define important elements as default granularity of aggregation as well as direction of filter context propagation(this last one explained here).  Without valid lookups in your model, developing a PowerPivot report can be a bit frustrating as you constantly hit walls that arise when breaking fundamental rules of its calculation engine.

I explained here a solution for creating lookup tables when the model uses data that exists in Excel worksheets.  The solution works well to satisfy one-time ad hoc reporting requests, or when building a BISM Tabular prototypes.

However, what if you want to avoid having to manually maintain the values that make up those lookup tables?  By using a slight variation of the data merging method detailed by Debra Dalgleis here, we can achieve this automation. In this case, though, we won’t be merging files with identical structures but instead selecting common columns across datasets with the explicit objective of generating lookup tables.

Suppose we have two Excel file sources:

ORDERS1.XLSX

image

and ORDERS2.XLSX

image

Note that we have two files with different structure (column names, and total amount of columns).  If we wanted to tie both tables in a PowerPivot model (for example, using the common Order Date column on both sides) we would get the famous error:

image

As already mentioned, what we need is to create a valid lookup table which serves as an intermediary between these two base tables (base, as neither one can serve as a lookup). In this case, instead of using the ‘remove duplicates’ feature in Excel, we are going to leverage the ability to use SQL syntax when querying multiple Excel files.

Add a connection on the Excel window (not the PowerPivot window), by browsing to the Excel file source

image

In the PowerPivot window, open the connection to the workbook by going to ‘Existing connections’, ‘Workbook Connections’:

image

When opening that workbook connection, change the friendly name of the in-memory table to ‘Date Lookup’.

Open the ‘Table Properties’ menu on the PowerPivot window, switch from ‘Table Preview’ to ‘Query Editor’ and change the query to:

SELECT [Order Date]  FROM [Sheet1$]
UNION
SELECT [Order Date] FROM `C:\temp\Orders2.xlsx`.[Sheet1$]

With the result of this query, we can now designate a valid lookup table that is used to connect all other tables appropriately:

image

We could now generate DAX artifacts to consolidate values across the model.  For example, the following calculated column can compute a Order Count across all dates (in both original tables):

=CALCULATE(

               COUNTROWS( VALUES(Orders1[Order Number] ) )

               +

                COUNTROWS( VALUES( Orders2[Invoice Number] ) )

)

image

Note at this point that unlike the technique described by Debra,  the structure of the files here was not identical: one dataset listed order numbers whereas the other one listed invoice numbers. Using a combination of custom SQL syntax and DAX, we are able to achieve the structure needed to consolidate both lists of values into a comprehensive list that serves as a perfect candidate for a lookup (dimensional) table in our model.

This method provides a way to better automate PowerPivot reports that get data outside of traditional corporate repositories. Many systems can be automated to generate Excel ‘data dumps’.  Traditionally, this have been a challenge when dealing with PowerPivot models with clearly defined dimensional schemas.  This technique, however, provides a viable method of maintaining values of lookup tables without the need for manual intervention — as new data is added to either file the load query is able to automatically capture all new values from both sides and in doing so, regenerate the lookup table.

Advertisements

Profiling data in PowerPivot: Detecting duplicates

Generally – when you are given an unexplored dataset – your first task should be to identify some of its properties in order to make a more informed decision regarding the accuracy of reporting you can achieve with it.  Many Business Intelligence professionals have robust tools at their disposal that can be used to deeply explore a dataset; however, data analysts can use the inherent capabilities of PowerPivot to get the job done.

One of the most common data profiling assessments is identifying duplicate values.  This is important as we want to decide on what sets of data can be good candidates for generating lookup tables in our model; And this is of particular relevancy when querying transactional sources that do not have clearly defined dimensions.

For example, say you want to load a product table into your model. In this case you are actually querying the enterprise data warehouse and the table you are using stores data for a slowly changing dimension of type II.  For this reason it can have legitimate reasons for having duplicate product names: it is storing the product attributes as they have changed over time:

image

Notice the highlighted product (AWC Logo Cap) has three entries on that table.  Over time, the product had different standard costs associated with it and the data warehousing team wanted to keep all history associated to past records.

If we intend, however, to mesh this data with another dataset that somebody sent to us (which did not come from the data warehouse and only has product name on it – not product key), we would want to know

a) are there duplicates on this table. We want to ensure this table can be used to generate a valid lookup table in the model, and PowerPivot requires unique values on the key columns of the lookup table when establishing relationships

b) If there are duplicate values, are they a data quality problem or based on legitimate reasons.

In our example, we know there are no data quality issues and we could easily generate a valid lookup out this table simply by going to the table import wizard and selecting the [EnglishProductName] field from while specifying a ‘Grouped by’ aggregation (using the same techniques detailed on this blog entry), or by issuing a T-SQL DISTINCT clause on the underlying query.

If the duplicates are based on data quality issues, we could instead ask the data source owner to fix them directly, even before it makes it to PowerPivot.

So now that we have determined that it is worthwhile identifying which records are duplicate, we need to know how to do this quickly using DAX.

On the dataset above, this can be done using the following DAX calculation on a new column:

=CALCULATE( COUNTROWS( DimProduct ),

ALLEXCEPT( DimProduct, DimProduct[EnglishProductName] ) )

In other words, we want to count all rows on the table by establishing a filter context in which only the column we are testing is on it:

=CALCULATE( COUNTROWS( ‘Table’),

ALLEXCEPT( ‘Table’, ‘Table’[Column Tested for Duplicates] ) )

image

It is easy then to filter that column to only detect the products with more than one entry on that table. Notice how the column filter also quickly tells us there are products with 2 entries and others with 3 entries:

image

I used DAX’s ALLEXCEPT function, as we only wanted to test duplicates over one product attribute ([EnglishProductName]).  Had we been detecting for duplicates on a one-column table (one with only the column we are testing), we could have avoided this:

=CALCULATE( COUNTROWS( ‘DimProduct’ ) )

image

This is because on a one-column table, there is no need to exclude other attributes that can lead to duplicates.  On the other hand, when other columns exist, there is always the risk of having attributes with different values for the same dimension member (products, in this case).  An example would be the one mentioned above in which the product standard cost has changed over time.  As such, we must remove all those attributes from the filter context when we want to narrow our investigation to only duplicate names.

Once you have identified the problematic records, it will be easier to communicate an issue to the data source owner and more rapidly identify a course of action.  Even in the case you decide to use this table as a lookup in your data model, you will be now aware of potential future issues for which you can proactively plan for.