Target well-formed Data Models when using Power Query

Even though Power Query gives us the ability to generate impressive data transformations, is up to the model author to define a well formed Power Pivot Data Model that respects relationship principles. In the past, I blogged about 3 options to ‘merge’ data using Power Pivot:

  • Issuing a SQL-like query to Excel data sources (here)
  • Issuing a SQL query to relational sources (here)
  • Using Excel inherent capabilities (here)
    All of the options above required applying Power Pivot data modeling principles which in essence implied the creation of two additional ‘lookup’ (dimension) tables to connect the existing fact data sets.

Using Power Query, the logic is similar although we use the M language.  In this blog entry, I will be also using the same small dataset which, even though its tiny, exemplifies the typical issues encountered by analysts when attempting to merge data (duplicate key records – absent lookup tables, missing values and so on).

The first step is to generate 2 queries and directly load them into the model:

image

                                                                                                           image

To properly establish a relationship we must now extract two lookup tables from the data given:

  • A date lookup containing all dates across both datasets
  • a product lookup containing all products across both tables and assign an unknown value to missing categories

LOOKUP # 1 (Date Lookup)

The first one can be done with the following M query

let
    Table1Dates =
        Table.FromList(
            Table.Column(
                Table.TransformColumnTypes( Table1, { "Date", type text } ), "Date" 
                )
            ) ,
    Table2Dates =
        Table.FromList(
            Table.Column(
                Table.TransformColumnTypes( Table2, { "Date ", type text } ), "Date " 
            )
        ) ,
    CombinedTables = Table.Combine ( { Table1Dates, Table2Dates } ),
    DuplicatesRemoved = Table.Distinct( CombinedTables ),
    ChangedType = Table.TransformColumnTypes(DuplicatesRemoved,{{"Column1", type datetime}}),
    RenamedColumns = Table.RenameColumns(ChangedType,{{"Column1", "DateKey"}})
in
    RenamedColumns

Once in the model, we can apply a DAX calculated column to extract the month name that we want to use to slice and dice:

image

 

LOOKUP # 2 (Product Lookup)

The second lookup table can be generated using Power Query with the following script:

let
    Source =
        Table.Join(
            Table1,{"Product ID"},
            Table.RenameColumns(
                Table2,{ "Product ID", "Product ID2"}
                ),{"Product ID2"},
            JoinKind.FullOuter
            ),
    RemovedColumns = Table.RemoveColumns(Source,{"Amount", "Date", "Amount ", "Date "}),
    DuplicatesRemoved = Table.Distinct( RemovedColumns ),
    ConsolidateProductKey = Table.AddColumn(
                                DuplicatesRemoved,
                                "ProductKey",
                                each if ([Product ID] = null) then [Product ID2] else [Product ID]
                            ),
    RemoveOriginalProductKeys = Table.RemoveColumns(ConsolidateProductKey ,{"Product ID", "Product ID2"}),
    AssignUnknownCategory = Table.AddColumn(
                                RemoveOriginalProductKeys ,
                                "Custom",
                                each if( [Category] = null ) then "Uncategorized" else [Category]
                                ),
    FinalTable = Table.RemoveColumns(AssignUnknownCategory ,{"Category"})
in
    FinalTable

With the following result:

image

With this, we can finalize the Power Pivot Data Model using appropriate lookups and one final DAX measure:

image

The resulting model is well-formed and can slice two tables that by themselves could not be reconciled directly in the Data Model:

image

 

Is this the only way?

One can argue that rather that developing a well formed data model including dimension tables it may be simpler to append one table to the other one and have the Data Model consume it as a single table.  In fact, generating a single table off the data shown is quite possible using Power Query.

However, single table data models have challenges that make their use not ideal:

  • Calculations must rely on a single base level of granularity which may not properly reflect the business model and make require workarounds and ill performing computations.
  • Given the limitation outlined above, it may be necessary to create multiple copies of the data – each targeting different granularities – with unwanted consequence of having multiple versions of the same metadata
  • Reporting tools using parameters must execute SQL DISTINCT like expressions over dimension data to remove duplicates on parameter lists, most likely affecting performance and report usability
  • Advanced modeling scenarios like many to many cannot be represented

Finally, in closer examination even appending won’t do the trick in cases like the one discussed as it will fail to properly consolidate dimensions. Using the standard appending functionality, Power Query will leave, on the data used here, two designations for Product A: one with a category assigned to it, and the other one without.

image

An assumption could be that Product A *always* belongs to the same category ("C.A").  Data Analysts, as Subject Matter Experts, can confirm if such assumptions are correct.  In the example above we assumed it was, and further modeling and consolidation was needed.  The problem with the source files was that one of the columns – Category, in this case – was omitted on Dataset # 1. We ‘consolidated’ the product definitions via the M language, issuing a full outer join over the two datasets.

With this in mind, be careful on your approach to modeling of Power BI solutions.  Simplicity on Data Model development is important, but oversimplifying can harm more than help.

You can access the demo workbook here: PQ_and_data_modeling.xlsx 

Update 9/24/13: Added more details to the explanation of why appending tables should be avoided in cases like the one described.

Advertisements

DAX context propagation, inactive relationships and calculated columns

Marco Russo wrote recently an excellent blog post detailing the options you face when retrieving values from one table to another one via DAX’s USERELATIONSHIP function.   I found his post particularly interesting given I had written a post about a similar subject some time ago, in which a reached some conclusions that I want to revisit.

One of the solutions he describes (although he doesn’t recommend) uses the following formula signature:

CALCULATE (
    CALCULATE (
        VALUES ( <value on related column to retrieve> ), 
        <many-side table> 
    ),
    USERELATIONSHIP ( <one-side key>, <many-side key> ),
    ALL ( <one-side table> )
)

His recommended approach is instead to use LOOKUPVALUE.  However, I found the above calculation interesting and decided to break it down to understand it a bit better.   And here I ran areas of DAX that may be less than intuitive, yet worthy of a blog post.

 

Breaking it down

Here is the data model we use, which includes inactive relationships:

image

Step 1: To test the behavior of the inner most CALCULATE, is worth starting from a simpler starting point:

FactInternetSales[CalculatedColumn1]=
COUNTROWS( DimDate )

image

Clearly, the number reflects the total amount of records on the date table regardless of the row in context.  This in itself can be non intuitive for people starting to use DAX, as by default calculated columns evaluate under row context.  However, aggregate functions (SUM, MIN, COUNTROWS, etc) even in calculated columns will, by default, operate under filter context:  as there is no current filters in the filter context, we get the total count of rows across all date values.

Step 2: Context Transition

FactInternetSales[CalculatedColumn1]=
CALCULATE( COUNTROWS( DimDate ) )

image

Row context gets transformed into filter context and the only surviving row in the date table follows the active relationship on the data model.

Step 3: Context propagates for ALL rows in fact table at once

FactInternetSales[CalculatedColumn1]=
CALCULATE( COUNTROWS( DimDate ), FactInternetSales )

Internally, DAX generates an extended table using left outer join logic to propagate context. Given the setFilter parameter of the CALCULATE function operates on a filter context including all rows from the fact table, the resulting output is the total count of rows on the date table that have registered internet sales.  Non-intuitively, context transition does not occurs on a per row basis, even though we use CALCULATE.

image

Step 3a (test)-  Context propagates for one table, but still under active relationship:

FactInternetSales[CalculatedColumn1]=
CALCULATE( VALUES( DimDate[FullDateAlternateKey] ), FactInternetSales[ShipDateKey]  )

One may be inclined to think that using the key column of the inactive relationship as the setFilter parameter of CALCULATE will force the context propagation to retrieve the value used by the inactive relationship.  However, this is not the case and even though DAX retrieves only one value it is still the one associated with the active relationship given the internal extended table was already resolved (OrderDate)

image 

Step 3b (test)– CALCUALTETABLE returns standard context transition behavior

FactInternetSales[CalculatedColumn1]=
CALCULATE( COUNTROWS( DimDate ), CALCULATETABLE(FactInternetSales ) )

image

Wrapping CALCULATETABLE around the fact table used as setFilter parameter allows row context transition to happen again using the active relationship.

Step 4 – An outer CALCULATE propagates context using the inactive relationship, yet it collides with the current context transition using the active relationship

FactInternetSales[CalculatedColumn1]=
CALCULATE(
    CALCULATE(
        COUNTROWS( DimDate ), 
        FactInternetSales ),
    USERELATIONSHIP( DimDate[DateKey], FactInternetSales[ShipDateKey] )
)

image

If we take a peek at the DAX query plan, we understand the reason there is no output here.  Profiler trace shows 2 Vertipaq storage engine scans using different join conditions:

SELECT
[DimDate].[DateKey], […]
FROM [FactInternetSales]
    LEFT OUTER JOIN [DimDate] ON [FactInternetSales].[OrderDateKey]=[DimDate].[DateKey]

and

SELECT
[…]
FROM [FactInternetSales]
    LEFT OUTER JOIN [DimDate] ON [FactInternetSales].[ShipDateKey]=[DimDate].[DateKey]
WHERE
    ([DimDate].[DateKey], [DimDate].[FullDateAlternateKey], […]) IN {(20080421, 39559.000000, …[60398 total tuples, not all displayed]}

 

When the formula engine attempts to tie the results of the queries, the resulting context is empty as no date has the same shipdate and orderdate, making the two scan results incompatible.

Step 5 – Clearing the context on the active relationship key

FactInternetSales[CalculatedColumn1]=
CALCULATE(
    CALCULATE(
        COUNTROWS( DimDate ), 
        FactInternetSales ),
    USERELATIONSHIP( DimDate[DateKey], FactInternetSales[ShipDateKey] ) , ALL ( DimDate)
)

 

image

By the use of ALL, this time we force the query plan to remove the scan using a left outer join condition on the Order Date Key (the active relationship).  As the only other join condition that remains uses the Ship Date Key, this is the one that survives. Now we see the value returned is in fact using the inactive relationship (matching Ship Date).

 

Alternate Approaches for non-exact matches

As mentioned by Marco, use LOOKUPVALUE is your intent is to retrieve one value based on an exact matching condition.  In some case, though, you may want to use approximate matches – particularly when creating calculated columns as intermediate steps to measures, a common practice described on my prior post.  In this case, the approach using CALCULATE is still useful.

Two alternate solutions using a slightly more legible approach could be:

1) Using CALCULATETABLE

FactInternetSales[CalculatedColumn1]=
CALCULATE(
    VALUES(  DimDate[FullDateAlternateKey]  ),
    CALCULATETABLE( FactInternetSales,
                    USERELATIONSHIP( DimDate[DateKey], FactInternetSales[ShipDateKey] )  ,
                    ALL( DimDate ) ,
                    DimDate[CalendarYear] > 2007
                )
        )

By moving the USERELATIONSHIP to an inner CALCULATETABLE expression, it could be directly apparent we are manufacturing a table context of our own choosing and setting it as a filter parameter to the outer calculate. The third parameter of the CALCULATETABLE uses a condition not based on exact matches.

2) Using FILTER

FactInternetSales[CalculatedColumn1]=
CALCULATE(
    VALUES( DimDate[FullDateAlternateKey] ),
    FILTER(
        DimDate,
        FactInternetSales[ShipDateKey] = DimDate[DateKey] &&
        DimDate[CalendarYear] > 2007
    )
)

Interestingly, the calculation using FILTER appears to use a very efficient query plan with no extended tables generated and represented in the form Vertipaq SE scans using left outer joins. Given the setFilter parameter uses DimDate instead of FactInternetSales, there is no need to build extended tables to propagate context and the 3 scan are direct queries to tables in the data model:

SELECT
[FactInternetSales].[ShipDateKey]
FROM [FactInternetSales]

+

SELECT
[DimDate].[DateKey], [DimDate].[CalendarYear]
FROM [DimDate]

+

SELECT
[DimDate].[DateKey], [DimDate].[FullDateAlternateKey]
FROM [DimDate]
WHERE
    [DimDate].[DateKey] IN (20080105, 20080219, …[220 total values, not all displayed])

I am curious as to why the query plan did not register, on this last expression, the FILTER condition of ‘greater than’ (as it does when using the CALCULATETABLE expression).  However, due to a simpler query plan, I wonder if it could be possible the FILTER method would perform better than the CALCULATE method using USERELATIONSHIP even in cases in which only the exact match condition is needed.