Creating lookup tables in PowerPivot using SQL queries

On the last post I showed how to model for a scenario in which two datasets must be merged.  A custom lookup table was created using the ‘remove duplicates’ functionality in Excel.  In this post I will walk through the same scenario, but this time using the SQL query wizard in PowerPivot to achieve the same result.  This post is intended for power users who will benefit from creating robust PowerPivot data models.

First of all, lets clarify why this is useful: in order to create a robust PowerPivot report, it is sometimes necessary to create lookup tables to allow PowerPivot relationship rules to work seamlessly (as noted on this post).   In some cases you may want to prototype this manually (as described on my prior post).  However, once you have a final model you want to re-use, you may want to consider using SQL queries instead of manually removing duplicates, in order to benefit from a fully dynamic PowerPivot model that automatically updates lookup tables upon refresh.

SQL query example

I have used the same extract table samples as in the last post.  In this scenario, we don’t want to generate a lookup ‘reference’ table manually, but instead relay on the underlying relational data source.

The modeling solution discussed on the last post is still valid as well: we must create two new lookup tables:  Products and Dates.

Lets start with Products.   Suppose I have already loaded both datasets from a SQL Server source. By clicking the ‘reuse existing connections’ option in the ribbon under the ‘design’ tab, we can re-connect to the source and use PowerPivot ‘Table Import Wizard’ to create the lookup table.   Instead of selecting from a list of tables, I can use the option ‘Write a query that will specify the data to import.’ Even though I am telling PowerPivot I will write a query, there will be no need to write code at all.

Select the ‘design’ button:

image

At this point the SQL wizard opens.  Here are the steps:

1.  Select the Product ID field from Table 1 and the Category from Table 2:

image

2. Click on the ‘Group and Aggregate’ button on the top right corner

image

3. Expand the relationships tab

image

4. Deselect “Auto Detect” on that same tab.  Click on the ‘add relationship’ button:

image

5. Click on the Left Table, specify Table 1.  Then click on the Right Table and specify Table 2.  Change the join type to ‘Full Outer’ (this is needed when creating lookups that span across multiple tables on the same underlying data source. Here is a good explanation of what joins do). 

Finally, select the Join Fields by double clicking on the field and selecting Product ID (on Table 1) equals to Product ID (on Table 2).  Here is how the window should look like:

image

What we have done is select all the Product IDs from Table 1 (as that table contained a comprehensive list of products), and all the Categories from Table 2 (as there were no category columns on the first table).  The ‘grouped by’ option is needed to generate a distinct list of product/category combinations (remove duplicates), and the join was necessary to grab data from both tables.

Once finished, clicking will show you the query PowerPivot has written for you.  At this point, you can rename it so it will have a more user friendly name in your data model:

image

At this point, I can now join the two initial datasets with this lookup table:

image

Again, it was necessary to do this in order to allow filtering of both initial datasets by a common category.  Without this lookup table, it would not be possible (or really difficult) to report on both sets of data in a consolidated view.

Now lets go for the Dates lookup table.  In this case there is an small extra challenge: neither table has the complete list of dates.  So after we configure the wizard using the same technique above, we get the following:

image

which results on this in-memory table:

image

With DAX, we can easily overcome this.  By creating a new calculated column, we can now have the complete list of unique days in only one place:

=IF([Table_1 Date] = BLANK(), [Table_2 Date], [Table_1 Date])

image

Further calculated columns can add more calendar levels for analysis, like month, quarter, year and so on.

image

The data model (in the PowerPivot design window) now should look like this (once the relationships have been created):

image

To be able to slice “Total Sales Amount” by Product or Date, we only need to do one extra step – create a DAX measure to add the Amounts from both extracts (Table_1 and Table_2, just was we did in the last blog post):

clip_image012_thumb1

With this, you can now slice the consolidated amount (from both extracts) by any relevant Product or Date category with the added advantage that PowerPivot will update the lookup tables once a data refresh has been executed.  In the end, not only the DAX calculations will be dynamic (based on query and filter context), but the data model will by dynamic based on the newly imported data.

Merging data in PowerPivot

Update 9/23/13:  Check an example using Power Query to deal with the same issue described on this article: https://javierguillen.wordpress.com/2013/09/23/target-well-formed-data-models-when-using-power-query/

One of the most common requests I have seen from data analysts working with PowerPivot is how to merge two or more datasets. For example, you are given the following two sales extracts and are asked then to generate a YTD report based on their composite data

Extract 1 (Table 1):

image

Extract 2 data (Table 2):

image

If we want to report on total amount for the year, we must combine these two files. How do we do this?

Somebody new to PowerPivot may be tempted to add a relationship between those two tables on the common key column (Product ID). However, when attempting this, PowerPivot will return an error:

image

PowerPivot complains that there are duplicate values, which indeed there are (on both tables, Product ID has multiple values that are the same). Here is one clue: if you examine the relationship window closely you will notice that one table is a ‘lookup’ whereas the other one is not:

image

For this lookup table, we want unique values, as there is no point on having a lookup table that would resolve in more than one value. Why is that? Imagine you have the following tables:

image

Notice how there was a sales amount for 1/1/2012 for the city of Bogota. If we wanted to report on the sales figures by Country, which country should we use from the look up table? (there are two different matches for the same city name so PowerPivot couldn’t logically resolve this relationship).

So the rule for PowerPivot lookup tables is that the field in which we join must have unique values (no duplicates allowed).

Let’s go back to our example. How are we going to merge the tables if we have no valid ‘lookup’? Here is where the world of ‘self-service BI’ comes to life: if we don’t have the table we need, we can create our own!  This effort is what is referred to as data modeling, as you are organizing the structure of the data as it resides inside PowerPivot.

I will walk through the example in two different scenarios, fixing the data right in the spreadsheet or relying on an underlying relational database (you can find the database  scenario on the following blog entry available here).

Worksheet only example

Ever wonder if there was a way to give Excel a list of values containing duplicates and have it remove those annoying ‘dups’ automatically? Well, such a button exists in the Excel ribbon indeed and is called – not mysteriously – ‘Remove duplicates’:

image

Let’s use it to create our lookup. As we want to join by Product ID, our lookup will be a Product table. Notice also we have Product Categories in one of the tables, and it is always a good practice to encapsulate all related attributes in one logical grouping, hence those category names should go into this new table as well.

Copy and paste the Product ID and Category from ‘Table 2’ somewhere into the worksheet (including headers). Then copy and paste the Product ID values from ‘Table 1’ below the last value. Now, highlight the values and click on the ‘remove duplicates’ option in the ribbon. Select only the ‘Product ID’ column:

clip_image006

You can now import the resulting table into the PowerPivot model:

image

For usability, I created a calculated column that shows the value “uncategorized” whenever the product didn’t belong to a category, and then I hid the imported ‘Category’ column:

clip_image008

Now we can join both initial tables to our custom lookup and PowerPivot will be happy. If you notice, we have gotten closer to the creation of a star schema. This is interesting to notice, as advanced data analysts will quite literally engage in testing data models as they produce reports — data models than I.T. can then qualify for upgrade to full blown SSAS BISM Tabular models (with the added advantage that they will be already ‘tested’ by subject matter experts).

Anyhow, here is how our new data model looks like in the PowerPivot design window:

clip_image010

We can now test our data model in a pivot table:

clip_image011

Notice we still have two ‘Amount’ fields. We must consolidate them into one. That is pretty simple to do, by creating a new DAX measure:

clip_image012

Now we can remove the other two measures and leave only this [Total Amount] measure. The resulting pivot table will have successfully merged the amounts from the two monthly extracts:

clip_image013

We can now execute the same logic and create a second lookup table that will allow us to slice by date or month. Following the same technique, we arrive at this data model:

clip_image015

I added a DAX calculated column to the new lookup which shows the month name for each date and hid the initial extract tables in order to ‘clean’ the data model and increase usability. We can now slice both tables easily and create an interesting report:

clip_image016

I added a DAX year to date measure:

YTD:=TOTALYTD( [Total Amount], ‘Date'[Date] )

plus a pivot table and a pivot chart. With this, it is now easy to see:

*  the best performer (category “C.A.” on Jan),

*  the worst (non-categorized products on Feb) and

*  category “C.B.” is growing at a pace that is outperforming the others.

Very powerful analysis is easy to achieve when understanding the rules under which the PowerPivot engine operates!

Enabling Tableau to read PowerPivot model updates

One of those interesting opportunities that opens up when using PowerPivot in conjunction with SharePoint 2010 is that of leveraging the PowerPivot cube-like interface of its inherent data model in order to extend the analytical arsenal of an organization.

Tableau is a great tool that can extend the reach of PowerPivot deployments due to its advanced visualization capabilities.

Its typical for data analysts to go over many iterations of a PowerPivot report (model) before they consider it good enough to share it with others.  At this point, though, Tableau 7.0 does not by default grab the latest and greatest version of the PowerPivot analytical database, the one with the must current data model and DAX calculations.

Lets walk through a typical scenario.  A data analyst creates a PowerPivot report and publishes it to SharePoint. He then wants to create additional visual analysis using Tableau against the data model that was loaded into PowerPivot.

A typical PowerPivot report URL location would look like this:

http://myServer/PowerPivot Gallery/MyPowerPivot.xlsx

image

In Tableau, he can then use this URL to connect to the cube-like interface of the underlying PowerPivot data model.  To do this, he uses the Microsoft Analysis Server option, and instead of typing a server name simply he uses the URL location of the PowerPivot workbook:

image

To be clear, you are not connecting to the Excel pivot table, but to the underlying PowerPivot model.  This normally contains more data and metadata than the actual Excel report rendered in the PowerPivot gallery.  At this point, he is able to create interesting Tableau analytics using PowerPivot as the source:

image

Now he wants to add additional DAX measures to extend the analysis. He adds the following expression to PowerPivot:

% of Total Sales:=[Sum of SalesAmount] /

CALCULATE([Sum of SalesAmount], ALL(DimSalesTerritory) )

And republishes to SharePoint.  By interacting with the newly published PowerPivot report by either selecting a new filter, invoking the workbook through a client via the SharePoint URL or simply by scheduling a data refresh, an updated version of the PowerPivot underlying model is stored in the dedicated instance of Analysis Services that SharePoint uses.

At this point, If you imagine being the data analyst in question, you may now attempt to re-open the Tableau workbook but you will get an error message:

image

And the report is wiped out.  Hmm not good!   Interestingly, other clients like SSRS, Report Builder and Excel can get the latest version of the PowerPivot model without an issue.  So how can we enable Tableau to leverage the PowerPivot model changes seamlessly?

The solution is fairly simple, though not very intuitive.  Prior to doing any changes to the PowerPivot model, we must save the Tableau connection:

image

This process creates an actual file that we can then edit in notepad.  We must remove the database name from the connection file (highlighted below) and re-save:

image

This is necessary as Tableau is creating a connection to a specific version of the PowerPivot model which is not valid anymore.  By removing the database name, SharePoint is able to point Tableau to the newest PowerPivot model available, which will then enable the report to work without errors and with all the changes to the underlying model.  After implementing the solution, the data analyst can now use this saved connection to create Tableau workbooks that will automatically update the metadata to match that one of the newest PowerPivot model.

image

In the scenario discussed, he would see the new DAX measure in the Tableau measures pane and use it as part of the visualization:

image

It is possible to save these connection files to Tableau server and share them with others, so newly created views will benefit from automatic synching with the most updated PowerPivot hosted models.

image

Keep in mind that one of the great benefits of connecting to PowerPivot workbooks hosted in SharePoint is that you get additional features like automatic data refresh and security, along with the ability to centralize one model that can serve many reports across multiple platforms and tools.

In my opinion, this technique allows data analysis to benefit from the best of both worlds when conducting rapid iteration self-service BI.

“Mark as Date Table”: When not to use it

In PowerPivot 1 (2008 R2 version), you had to specify an extra parameter in your Time Intelligence calculations when the relationship between the fact table and the date table was based on smart key integers:

=TOTALYTD(

SUM( FactResellerSales[SalesAmount] ) ,

DimDate[FullDateAlternateKey] ,

ALL(DimDate)

)

If you didn’t do that, the output would incorrectly show the figures only for the current month in context, instead of the YTD amount (or QTD, etc).  The only other way to avoid this behavior was creating the relationship between the tables using date types instead of integers, which is not generally the case when reading from relational databases.

In PowerPivot 2 (2012 version), a new option was added to the PowerPivot Window ribbon to help avoid this confusion and add consistency to the output of Time Intelligence calculations.   This button is called ‘Mark as Date Table’ and when using it, you can forget about the need to specify ALL in your time-driven calculations as it is not necessary.

image

image

Interestingly, though, the fix for one problem can be the genesis of another one.  Imagine for a moment we want to create a report in which we are displaying monthly and daily values, yet we want to clear the filter context at the daily level.  In other words, we want the daily values to display the monthly amount.

After we have configured the table as a ‘Date Table’, we are unable to get the proper value. As an example, lets compare these two DAX expressions. They clear the filter context on either the smart key or the date type on the calendar table:

Measure using Integer Type:=CALCULATE( SUM( FactResellerSales[OrderQuantity] ), ALL(DimDate[DateKey] ) )

and

Measure using Date Type:=CALCULATE( SUM( FactResellerSales[OrderQuantity] ), ALL(DimDate[FullDateAlternateKey] ) )

What we want to test is the ability to get monthly values at the day level.  The following matrix shows success/failure on the desired output:

  Measure using Integer Type Measure using Date Type
Relationship based on Integers (smart keys) Fails: filter context not affected Fails:  filter context removed completely
Relationship based on Dates Fails: filter context not affected Fails:  filter context removed completely

Interestingly, though, if we remove the demarcation as ‘Date Table’, we get the following:

  Measure using Integer Type Measure using Date Type
Relationship based on Integers (smart keys) Fails: filter context not affected Succeeds: Filter context set properly
Relationship based on Dates Fails: filter context not affected Fails:  filter context removed completely

One combination succeeds, and allows us to show the monthly values at the daily level when using the date type column on the row labels.  Unchecking the ‘mark as date table’ option of the calendar table and using smart keys to enable the relationships, we can then clear the filter context by using ALL over the date type column (FullDateAlternateKey):

image

If you need to absolutely go this way and remove the demarcation of date tables to enable this functionality then you will have to go back to using ALL as an extra parameter to time intelligence functions.  Although a bit annoying, at least is good to have the work around handy.