Nurturing Model Growth: Prototyping ETL using Excel sources

The BISM Tabular framework’s extensibility does not only imply a move from personal PowerPivot models to corporate Tabular models but also an ability to ‘branch out’ processes into more sophisticated layers of ETL and/or reporting.

Over the last months I have witnessed PowerPivot models that ‘take a life of their own’, and through their growing popularity become more sophisticated as they evolve. As their data load facilities develop further, they offer the ability to prototype ETL over a target in-memory schema, while testing assumptions regarding the appropriate data model to use.

I call these ‘seed models’ ::  that is, individual PowerPivot models that have potential for organic growth beyond its original intent – all the way into enterprise Semantic Models.  As they grow, they themselves can become the corporate standard or instead be assimilated into pre-existing Tabular models. 

image

The process of nurturing model growth (from individual to enterprise level) is a fascinating task.  This blog post explores one aspect of model expansion – from Excel files (a very common source for data analysts) to SSIS based ETL. 

It is interesting to note this expansion happens at the pace that makes most sense for the organization given its current BI maturity, and at every stage of the process there is tangible benefits to the business.

GROWTH STAGE #1:  SELF SERVICE BI BASED ON ‘DATA DUMPS’

Here you find Self-Service BI scenarios in which analysts utilize extracts from multiple systems to generate their consolidated reporting.

Assume the following datasets:

ProductList.xlsx

image

 

ProductSales.xlsx

image

 

As you can see, the second workbook (ProductSales.xlsx) contains data that can be mapped back to the first one (ProductList.xlsx).  However, it is not just a fact sales table, it also contains dimension data (Product Name and Model). 

In the PowerPivot/Tabular world (just as in traditional data warehousing) it is recommended to structure data conforming logical entities. By executing a SQL query over these two Excel files it is possible to generate the necessary lookups to develop a user friendly dimensional model.

To do this, it is necessary to create a connection to the files outside of the PowerPivot window (this enables us to execute SQL queries over Excel files).

Go to Data –> Connections and click ‘Add to the Data Model’. When the next window appears, click on ‘Browse for More’

image

Find ProductList.xlsx and load it into the model.  Now follow the process again for the ProductSales.xlsx file, but this time do not add it to the data model, simply add the connection.

image

In the PowerPivot window, go to the Design tab and select ‘Table Properties’.  You will notice a prompt asking for a password, which you can simply ignore by clicking OK:

image

This prompt is not ideal as it can get on the way of automated refreshes.  To remove it, it is necessary to change the connection string,

From:

Provider=Microsoft.ACE.OLEDB.12.0;User ID=Admin;Data Source=C:\blog\ProductList.xlsx;Mode=Share Deny Write;Extended Properties="HDR=YES;";Jet OLEDB:System database=;Jet OLEDB:Registry Path=;Jet OLEDB:Engine Type=37;Jet OLEDB:Database Locking Mode=0;Jet OLEDB:Global Partial Bulk Ops=2;Jet OLEDB:Global Bulk Transactions=1;Jet OLEDB:New Database Password=;Jet OLEDB:Create System Database=False;Jet OLEDB:Encrypt Database=False;Jet OLEDB:Don’t Copy Locale on Compact=False;Jet OLEDB:Compact Without Replica Repair=False;Jet OLEDB:SFP=False;Jet OLEDB:Support Complex Data=False;Jet OLEDB:Bypass UserInfo Validation=False;Jet OLEDB:Limited DB Caching=False;Jet OLEDB:Bypass ChoiceField Validation=False

 

To simply:

Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:\blog\ProductList.xlsx;Extended Properties="Excel 12.0 Xml;HDR=YES"

Remember, in Excel 2013, once the connection has been modified in the PowerPivot window it is not possible to edit it (in Excel 2010 this is possible).  In other words, be sure to implement the change prior to doing any PowerPivot model work.

Warning:  Automated data refresh using Excel sources in PowerPivot for SharePoint is not currently supported (see here).  However, the refresh will work and under your discretion of ensuring controlled access to the source files it is possible to generate a solution that provides much value to the business. 

Now click on ‘Switch to Query Editor’:

image

By joining both files, we are able to mimic the type of queries we commonly use as part of ETL processes to define target dimension tables.  This process involve defining a ‘driving’ table, commonly associated with the master list of dimension attributes, which is left joined to other tables which may contain additional attributes (a common pattern in ETL development).

The SQL query to use is the following:

SELECT
[ProductList$].*,
[ProductSales$].[Product Name],
[ProductSales$].[Product Model]  
FROM ([ProductList$]
    LEFT OUTER JOIN `C:\blog\ProductSales.xlsx`.[ProductSales$] 
ON
    [ProductList$].[Product ID] = [ProductSales$].[Product ID])
WHERE
    (
    [ProductSales$].[As of Date] =
        (SELECT MAX([ProductSales$].[As of Date]) FROM `C:\blog\ProductSales.xlsx`.[ProductSales$] )
    )

The use of this query pattern gives the PowerPivot data model an important artifact: a lookup table. Generating valid lookup tables is a big part of Tabular development, as they will keep DAX expressions simple and prevent unexpected context propagation behaviors.

This new lookup table will represent an SCD Type 1 dimension.  I argue this is a good way to start proving business value, in many situations a predecessor to enabling historical tracking.

By loading the ProductSales.xlsx table, we now arrive at the following dimensional schema:

image

generated SCD Type 1 dimensional schema

With the following result:

image

 

GROWTH STAGE #2:  MOVING TO RELATIONAL SOURCES

Once the model is proven popular and useful to the business,  the BI team may decide to take the ETL prototype query and standardize it in a SQL view:

CREATE VIEW [dbo].[v_ProductList] AS
SELECT       
        ProductList.*,
        ProductSales.[Product Name],
        ProductSales.[Product Model]
FROM           
        ProductList LEFT OUTER JOIN
        ProductSales ON ProductList.[Product ID] = ProductSales.[Product ID]
WHERE       
    ProductSales.[As of Date] =
    (
        SELECT MAX(ProductSales.[As of Date])
            FROM ProductSales
    )
GO

Notice the pattern used would be very similar to the one used to create a valid dimension (lookup). In the model, we can now change the connection to point to this view and get rid of the Excel source dependency. To do this, we must change the connection used to populate the tables from a file source to a SQL Server source.

Go to ‘Existing Connections’j, select ProductList.

image

By clicking ‘Edit’, it is now possible to change the connection type by clicking ‘Build’.  When the ‘Data Link Properties’ window opens, change the provider to ‘SQL Server Native Client 11.0’.

image

At this point, you can connect to the SQL Server source and specify the database that contains the extracted datasets.

image

Once the connection is mapped back to SQL Server, we must change the query definition that makes up the ProductList. This can be done in using the ‘Table Properties’ option of the model.

image

Given the maturity of the implementation, it may be advantageous to upgrade the PowerPivot model to BISM Tabular, while also developing more sophisticated ETL.

 

GROWTH STAGE #3:  ENABLING HISTORICAL TRACKING

The business may decide it would be beneficial to enable historical attribute tracking to create more sophisticated reporting.   At this point, we can use a variation of the query we have discussed in order to initialize a target data warehouse dimension “DimProduct”.

SQL Server Integration Services could now fulfill that exact requirement, but producing the more complex transformations to record attribute changes over time. 

 

image

Once the SCD Type 2 transform has been defined, it is important to re-map the Tabular model to use surrogate instead of natural keys as part of the relationship.  This can be simply done in the PowerPivot or Tabular model design interface:

image

generated SCD Type 2 dimensional schema

The result is a model that grew from simple file sources to a more sophisticated environment leveraging appropriate data transformations.

image

The process described is not a theoretical one but is based on real scenarios I have seen evolving without much effort, as each step fuels the project further. In my opinion, this makes up for great “BI”, as nurturing model growth allows us to stay relevant by developing close to the business, and in turn ensures accuracy and strong sponsorship on the projects implemented.

PivotStream and Mariner join forces to deliver PowerPivot solutions

I am excited to share the news that PivotStream, the world’s leading provider of hosted PowerPivot solutions, and Mariner, the company I work for in which I help device PowerPivot solutions and practices, have entered into a partnership agreement to provide services that leverage our complementary skills and expertise.

image

It is my hope this partnership will speed up the propagation of the the PowerPivot story as we help organizations take advantage of the strength behind business-driven data discoveries. Our goal is to help Excel analysts generate critical business impact in decision making, despite tightly I.T. controlled regulations on data production and consumption.

I am looking forward to help take our clients to the next level of business agility !!! :)

Configuring “drill to details” behavior in PowerPivot: Part 1

It has been said there is no way to change the default drill through behavior in PowerPivot.  However, if we look closely we discover that we can actually exercise a bit more flexibility than normally assumed when using this feature.

The first thing to understand is that drill through retrieves data from whichever table the measure has been assigned to.

This is an interesting feature – yet, not very intuitive –:  as it is possible to assign a DAX measure to any table in the model (even disconnected ones) without affecting the output of the measure at all, one can allocate measures to tables specifically designed for drill through.

Lets look at an example using Adventure Works.

image

[Sum of SalesAmount] is an implicit measure.  As such, it was allocated to the table with numeric values (FactResellerSales). Notice this table is on the many-side of the context propagation.  When executing a drill through, we get an ‘exploded’ view of that table showing raw data:

image

Also notice that many technical columns are retrieved, which may not provide any value to the analyst (ResellerKey, for example) .  Instead, lets create a custom table to support a more customized drill through view, with only the columns we are interested in.  I will use the Table Import Wizard instead of writing straight SQL, as I believe this is something analysts will be able to configure on their own without the need to code:

image

By specifying the use of ‘Group and Aggregate’, we will get a list of the distinct product and order date keys in our fact table. As we might not interested in the original granularity as we are simply supporting a drill through action in order to show what dimension members make up a specific sales amount, a list of product and date members is sufficient in this case.

image

We can source the desired columns from the lookup tables by using RELATED. Or, to save memory space or for model usability, you may decide not to bring those columns on the visible lookup tables and only retrieved them from the source to populate the custom drill through table.

All columns should be hidden (do not hide the table, only the columns.  Hiding the table will hide the measure we will assign to it which is not what we want).

An explicit DAX measure is created at this point:

[Sum of Sales]:=SUM( FactResellerSales[SalesAmount] )

and assigned to the ‘drill through sales’ table:

image

As only the measure is visible on this table, the pivot table field list will not show the regular table icon but one indicating the summation sign:

image

When clicking on the cell we clicked before, this time we get a more ‘customized’ drill through:

image
Note the keys connecting to the related tables are hidden, they are never retrieved on the drill through view.  This is good, as they are technical and not needed for analysis.

The order of the columns depends on the order in which you added the RELATED columns on your custom drill through table.  Simply moving the column to another position won’t change it on the drill through view.  To change order, it is necessary to re-create the order you wish to see.

In order to custom drill through views to work, the default context propagation (from the one-side to the many-side) must be respected. As such, if you want to provide a small drill through experience from the visible lookup table you should snowflake the model instead of connecting both tables (visible lookup and custom drill through) to the fact table separately:

image      image

Note: in Excel 2013, there appears to be no way to change the default behavior of 1000 rows retrieved (this was not the case in Excel 2010).

Developing child-aware KPIs in PowerPivot and BISM Tabular

Sometimes you want to see if a parent level has violated a target value based on a comparison at the child level.  For example, you want to browse a dashboard showing Product Categories and visually indicate which of those categories had Product Subcategories that violated a designated sales quota.  Note that the quota is not assigned at the Product Category level, but the Subcategory level instead .

So we first must be sure to create a DAX expression that will decisively compute at the child level desired.  In the example above, the target quotas would look like:

image

I am only showing a few of them but I hope the point is clear (also, for simplicity, I have not defined goal amounts per year, which is certainly doable).

For any given subcategory, we have a determinate ‘goal’ or target sales quota.  As this table is now just a lookup in the model, we can connect it to the dimension with appropriate granularity (DimProductSubcategory)

image 

The expression needed to see what we are comparing against at the Subcategory level is:

Goal:=If (
  ISFILTERED ( DimProductSubcategory[EnglishProductSubcategoryName] ),
  If (
    Sum ( FactResellerSales[SalesAmount] ),
    Sum ( SubcategoryTargetAmount[Goal] )
  )
)

The output of which is:

image

At this point we could think about using built-in KPI capabilities.  In this scenario, however, this would not be helpful as KPIs, by default, compute at the level of granularity displayed and they also include subcategories even if the is no sales or goal data for them:

image 

In the screenshot above, I removed the ISFILTERED condition of the Goal Amount DAX expression so it will show a value at the parent Category level. Notice that the status is green as the comparison – at that level – is favorable.  This is *not* what we want:  instead, it should be flagged as RED when any child level violated the goal amount (which is the case of the Accessories category shown). 

In addition to that, and as mentioned before, note that the KPI is computed for all subcategories, whether they have sales and goals amount or not: again, not ideal.

Instead, we will be creating our own custom KPI.  As such, there is a need to compute the ‘status’ value on our own  through a DAX expression:

KPI:=If (
  ISFILTERED ( DimProductCategory[EnglishProductCategoryName] ),
  SumX (
    Values ( DimProductSubcategory[EnglishProductSubcategoryName] ),
    If (
      [Sum of Sales Amount],
      If ( [Sum of Sales Amount] < [Sum of Goal Amount], – 1, 0 )
    )
  )
)

The expression always computes at the Subcategory level (even at the Category level).  It then rolls-up the amount of children that violated the goal (each of those will be assigned a value of –1).

When consuming the model through a pivot table, we can then use conditional formatting to mimic a regular KPI:

image

Notice that the rule is applied, not to a cell range, but to a measure (KPI).  Also, the value assignment per icon may look a bit strange as it is designed to work with three value bands whereas here we only have two (negative or zero). 

The final output properly display parent level alerts based on children level violations:

image

Its interesting to point out that event though there is 4 children that complied to the KPI rule and other 4 that did not, the parent level category does not “net out” the result.  This is an important distinction and the reason why the DAX expression uses 0 for positive values instead of 1.

Grouping by an Aggregation in DAX

Business users really like the option of comparing category members against the average for all. Let’s review how to accomplish this in DAX.

image

Suppose you what to find out which months had an above average sales performance on the data above (tables come from AdventureWorksDW).  We must first create a new table in the model, to hold the categories we want. In PowerPivot, these values can imported as a linked table:

image 

The resulting model is the following:

image

As sales data is per day, but we want to find out the monthly average, we must use an expression that computes this aggregate at the correct level of granularity:

image

With this, we can now use a measure leveraging a conditional DAX statement to categorize the monthly output as needed:

image

Notice here we also compute at the monthly grain by use of the SUMX function iterating over the list of unique month names. This is important in order to allow for proper aggregation at the subtotal level.

image

The data model allows now to create an interesting visualization, using color as an indicator, to emphasize those months that are below average (to graph the dotted line, I added one more value to the linked table to allow for display of the computed average value itself):

image

Grouping on Quartiles

Another interesting example of is grouping data by quartiles.  In order to do this, I modified Colin Banfield’s quartile DAX expression to dynamically compute based on the “utility” linked table which has no relationships with other tables in the data model, and contains the needed multipliers to compute at each percentile value.

image

A final expression allocates the sales amount to the appropriate quartile column, and ensure the computations happens at the product level:

image

Here is the result:

image

Notice that subtotals are computed on quartiles appropriate for the current category in the filter context.  Same applies to the grand total.

In Power View, I cross referenced the sales amount (on X axis) with two additional, quartile based metrics: count of sold products per category (Y axis) and % sold products over all available category products (bubble size). 

You can see that the “Component” product category had lots of sold products (out the entire category) on any given quartile, however “Bikes” had more products with sales on the 4th quartile than it did on the second or third, and overall was the most profitable category.

image

Leveraging DAX query tables for staging Data Transformations

When designing a data warehousing solution, it is typical to allocate certain data transformations to the ETL layer, particularly those that would be easier or more efficient to do outside of the cube.  At the same time – prior to Excel 2013 release – developing multi-step data transformations in a Data Model was, whenever possible, a factor that increased the complexity (and lowered maintainability) of DAX formulas.

Interestingly though, using DAX query tables - a feature Kasper de Jonge blogged about some time ago here – we now have the ability to generate materialized intermediate tables that can be consumed back into the Data Model, allowing us to generate a process that includes not only a semantic layer but a degree of automated data transformations.

So, as an example, here is what we are going to do:

image

To be clear, the output of the transform (the intermediate or *staging* table) also lives in the Data Model.  Nevertheless, those of you familiar with the data warehousing process will notice a similarity to a typical integration process. Also note that I used linked tables here for simplicity, but other sources are possible as well.

The goal is to take the table below (‘Check-Ins’) and generate a report that breaks down total full hours versus partial hours worked on a daily basis.

image 

As you notice, data is not currently in a way in which is easy to count how many full or partial work hours where recorded.  It would be easier to un-pivot this data, and cross join it with a reference table to break it down by hours. As such, the following generic and re-usable lookup table (‘Time Ranges’) should be added to the Data Model:

image

After importing them as linked tables, we then generate a place-holder table from any of the tables in the model.  This can be done, from Excel, under Data –> Existing Connections – > Tables Tab:

image

At this point, a DAX transformation query can be used to generate a new output.  Right clicking on the table generated gives us the option to edit the DAX query used to generate the intermediate (or staging) table. We are interested in seeing, for each person in the source table, the break down of full and partial hours.  Here is the query (thanks to Laurent C for the ideas on this particular data transformation):

image

                                image

image 

                                                                                   image

image

This table breaks down the full and partial hours worked per person.  It can now be re-imported into the Data Model, and enhanced with additional calculated columns:

CalculatedColumn1 ( FullHours)

=IF([CheckedIn] = "12/30/1899 1:00:00 AM",1)

CalculatedColumn2 (PartialHours)

=IF([CheckedIn] <> "12/30/1899 1:00:00 AM",1)

Out of which two fully additive measures can then be generated:

image 

With data from this intermediate table, the final report is now easy to create:

image

DAX query tables are great but they don’t – by themselves – offer a way to solve an important question:  The sequencing needed to compute the output.  In addition to that, we have the issue of the amount of data refreshes needed:

image

Coding against the sheet object model, however, we can consolidate the refreshes into one while controlling the order of execution required by the calculation.  This is what makes the solutions so similar to an ETL process. We get one button control that can:

1) import all changes from the linked table sources

2) refresh the intermediate DAX table query and

3) reflect the changes on the report, in one pass while controlling the other of execution.  Here is the code:

image

New data can be added to the linked table source, and by clicking on a button (with the VBA RefreshReport macro assigned to it), one can perform the sequential refresh needed for the Data Model to consumes itself in the staging table.

With the particular scenario used here, the whole computation could have been done in a single DAX expression, however, this approach would prevent reusability of the staging table, having to recomputed it again for each measure as it is not materialized.

[Notice Jason and Brent have been added to the source linked table]

image

                                image 

image

An outstanding question remains on how to move these ‘Personal BI’ initiatives into a wider Enterprise model.  It is still necessary to consider BISM Tabular models strip out the Excel dependency of imported PowerPivot models, yet this technique requires that exact dependency to be present.

This article is a companion to Jason Thomas’ most recent blog entry, which uses this technique to generate a data model that allows for cross-dimension filtering.

Running Product in DAX: Calculating Portfolio Returns

Recently a friend of mine asked me how to calculate investment rates of return in PowerPivot.  In order to do this, one must be able to run over a list of values using a multiplication. Currently, DAX does not have a ProductX or MultiplyX function to evaluate an expression in a specified row context followed by a final pass to multiply all computed values. I dealt with this issue in MDX years ago when attempting to do the same.  Although in that case I used the Multiply() .Net stored procedure to achieve the goal within the expression, I still recall seeing how Mosha pointed out a way to mimic a running multiplication using the following formula:

a * b = e^ ( LN(a) + LN(b) ) 

In DAX, we can leverage the same computation.  In simple terms, one can define a calculated column as:

image 

=POWER( 10,
    SUMX( 
        FILTER( Table1,
                Table1[Id] <= EARLIER( Table1[Id] )
            ),
        LOG( Table1[Values] )
    )
)

Notice we can leverage the SUMX expression over a table that grows based on the current row context, and calculate the LOG output which is summed at a second pass.  By raising the final scalar output to the power of 10, we achieve the running product effect:

image

Let’s now apply this technique to solve the question related to investment returns. Drawing from the common financial VAMI definition, we see that we must find a way to apply the logic below when determining returns over time:

Previous Rate of Return x ( 1 + Current Rate of Return )

At any given date, this formula will give us the rate at which the portfolio is performing. Notice that:

  • Although the original definition stipulates monthly returns, it is common to use it with daily returns as well.
  • The formula requires the use of a running multiplication
    I have placed the sample Excel 2013 file for this blog entry here, in case you want to follow the formulas by yourself.  The idea is we will build a hypothetical portfolio and calculate its rate of return.  For this, I downloaded closing prices from 1/3/2012 to 12/18/2012 for three stocks, and uploaded them into a PowerPivot model.

For simplicity, I did not account for dividends and I used the same date (1/3/2012) to establish a position and there were not subsequent buy orders.  Those items can surely be dealt with more DAX logic, but it is not necessary to simply illustrate the point of running multiplications.

I created a linked table with the info below. The quantity of each order serves as a parameter when using portfolio allocation to compute returns:

image

To first calculate the actual market value of the position, we use a measure defined as:

[Position]:=SUMX(
    VALUES(Prices[Ticker] ) ,
    CALCULATE(
        VALUES( Parameters[Quantity] ),
        FILTER(Parameters, Parameters[Ticker] =
EARLIER(Prices[Ticker] )  )
    )  *
    CALCULATE( MAX(  Prices[Close]  ), LASTDATE( Prices[Date] )  )
)

As we want the Portfolio Total (row total) to reflect the correct amount, we must calculate the value at the grain of the ticker symbol.  Additionally, as there is no actual relationship with the parameter table we must establish one through a calculation.  The result multiplies the order quantity for any given ticker symbol times the closing price for the trade date.  As we also want the column total to generate a valid dollar value, we use LASTDATE in the CALCULATE statement that fetches the closing price.  This is because MAX will get the maximum value for all days that make up the grand total row, but we want is the actual value for only the last date. A partial screenshot of the result shows the following:

image

We now want to compute the Daily Delta which is actually the daily rate of return.  For this it is necessary to get the last position value on the most current trade date (before today).  The position value, or market value, for trade date minus one is:

[Market Value T-1]:=CALCULATE(
    [Position],
    FILTER(
    ALLEXCEPT( Prices, Prices[Ticker] ),
    Prices[Date] =
        CALCULATE( MAX( Prices[Date] ),
        FILTER( ALLEXCEPT( Prices, Prices[Ticker] ),
            Prices[Date] < MAX( Prices[Date] )
            )
        )
    )
)

As we cannot be sure what T-1 will be, we must calculate the MAX date before today, and then retrieve the position value on that date:

image

The Daily Delta can now be easily calculated:

[Daily Delta]:=IFERROR( ([Position] – [Market Value T-1])/[Market Value T-1], BLANK() )

image

Notice at this point the Grand Total reflects the actual portfolio daily return, which is not a straight sum or average of the individual ticker symbol returns but instead is calculated at the total market value for the portfolio.

We are finally ready to apply the running product logic in order to determine compounded rates of return:

[Comp ROR]:=POWER( 10 ,
    SUMX(
        FILTER(
            ALL( Prices[Date] ),
            Prices[Date] <= MAX( Prices[Date] )
        ) ,
        LOG( 1 + [Daily Delta] )
    )
) – 1

image

For any given symbol or for the entire portfolio, we can now see what our percent gain or loss was at any given point during the life of the position. Charting this will help better understand the overall picture of our investments. In Excel 2013, it could look like this:

image

The entire report above is built using only the DAX functions discussed, as well as built-in Excel functionality (Moving average trendlines and cube functions are of great help when telling the story of investment performance over an Excel dashboard).  For example, to get the most current report date I used:

="(As of " & CUBEMEMBER("ThisWorkbookDataModel","[Prices].[Date].[All].LastChild") & ")"

Similarly, to get the portfolio return value (14.20 %), I simply queried the model from an worksheet cell with the following formula:

=CUBEVALUE("ThisWorkbookDataModel","[Measures].[Comp ROR]")

Finally, one of the beauties of Excel 2013 PowerPivot models is that there is no need to reopen the PowerPivot window when updating a linked table value.  Simply refreshing the dashboard through the Data tab will work to generate newly computed values.  As such, what-if scenario analysis is greatly simplified.  For example, and as you can see above from the Symbol % Allocation chart, YHOO was only a small part of the portfolio market value, however it had a great run at the end – which can be verified in the Symbol Returns chart.  If I changed the parameters of my report, and allocated a much bigger order to that symbol as in:

image

The returns will be now more interesting:

image

Also, you would notice the three moving averages pointing in one direction: up :-)

Thinking in DAX: Non-linear Cognition at Play

Over the last year, I have had the opportunity to teach PowerPivot and DAX to a good number of technical and non-technical people in a variety of industries.  After each training, I recount what went right and what went wrong, and I have identified some clues that offer some insight on the way people assimilate this technology and ultimately decide whether or not to link themselves to it.

Perhaps a chart can help explain this better.  Let’s step aside from DAX for a moment, and think about a more traditional subject area which is typically taught in school: World History.  

When you are learning World History, each milestone builds on the prior one and there is an incremental process which can be graphed as a curve:

image 

At any give point in the curve you are one step ahead than the prior one.  This, of course, with the exception of very last phase when you have arrived to a theoretical moment of complete assimilation of all knowledge.

If you are learning, for example, about the European Union it is likely you will be building on earlier topics related to World War II. As you follow the chain of topics, you can realistically stop in any of them without harm.

DAX does not  follow this learning pattern.  First of all, even for the some of the most basic Time Intelligence functions like “Running Total since Inception”, there is a need to understand more advanced concepts of filter context.  Due to this, the pattern can instead be charted as the following:

image

DAX can appear simple (and easy) at first due to its similarities with Excel functions.  Fully additive measures can be created quickly. In some cases, PowerPivot even creates them automatically (such is the case of implicit measures).   And you may quickly hit the sudden understanding that DAX generates portable calculations that can be sliced and diced by other data attributes.

At that point, an analyst using PowerPivot may feel empowered and attempt to generate something a little more complex.  Something, for example, like a “Year over Year Growth” ratio.  As there are no built-in functions to get the answer directly, there is a need to understand how to build your own; for first time users, though, the most likely experience is that one getting stuck and experiencing a bit of frustration.

And here lies the power – and the current weakness – of DAX based tools: they are – for the most part – a blank canvas. To develop some of the advanced, non-additive analytical measures you will want to use, there is a need to understand quite a bit of how the engine interprets incoming computations.   But for those that take the time and interest to learn how the language actually works, doors open to a huge analytical arsenal made up of DAX functions and patterns that can be used to develop virtually any kind of calculation needed.

The flat horizontal lines on my “DAX Learning Pattern” chart happens when the person is cognitively stuck in the “I don’t get it! I just don’t see it!” phase.  During that time, they may feel no knowledge is being acquired. They may be working on a DAX problem for a long time with no feeling of being close to a solution.

The best way I can explain this is through a Gestalt example:

image

What do you see?  An old woman with a big nose?  Or a young woman turning to the side?  

Either way, once you see the first one it will take a bit of work – and concentration – to identify the other one.   You may stay looking at the picture for some time but when you finally see it, it will be so obvious you can’t help but seeing it it now.

In DAX, this can happen when learning about the many powers of the CALCULATE function, or when fully assimilating the fact filter context propagate in one direction by default, or even when understanding how to change the granularity of a calculation using an AggX function.

In essence, what is required here is non-linear thinking:  rather than straight step by step logic, what advances our level of DAX is a sequence of cognitive reframing scenarios in which the solution is assimilated based on sudden understanding – many times associated with a loud ‘AHA!’  -, rather than through paced and gradual knowledge acquisition.

Also note that not all horizontal/vertical lines in the DAX learning pattern chart have the same size.  This is because after a deep reframing that leads to better understanding, there may be a small gain in direct formula applications.  In other cases, a less dramatic reframing can have huge impact in formula applications.  Or vice versa. In a similar way, you may be stuck for a short or long time, when trying to grasp how to get a formula to work, or when analyzing a how a working formula actually computes.

Let’s see an example.  Many of you are already familiar with DAX but for the sake of this example, imagine you are learning about the most relevant functions and you hear their definition for the first time.  In that scenario, lets define a few of them:

AVERAGEX/SUMX – Takes a table and iterates through each row calculating an expression.  As a final pass, all row expression output values are averaged or added (depending on the function selected).

VALUES – Returns a one table column removing duplicates over the column that has been used as a parameter.

That is clear and relatively simple.  When testing those functions (in this case, using AdventureWorksDW), we get the Total Product Cost by using:

[Total Product Cost]:=SUMX(
       FactResellerSales,
       FactResellerSales[OrderQuantity] *
       FactResellerSales[ProductStandardCost]
)

Also, I could count the products with sales on any given month with the next expression:

[Count of Products With Sales]:=COUNTROWS(
   VALUES( FactResellerSales[ProductKey] )
)

In the fact table, we have many sales for any given product.  By using VALUES we get the list of unique products with sales. At this point, the implementation of those functions directly reflects the definition given above and the calculation output is straight forward.

Let say somebody asks now to generate a [Total Product Cost] calculation in which, at the monthly level is the default value (sum) but at the yearly level it should be the average of all monthly values

For people learning DAX, this could quickly translate into the flat line on the DAX learning pattern chart. Interestingly, after the example given above – they already know all they need to know in order to get this done.  But it takes some time for the idea to really ‘click’.

Here is the calculation:

AVERAGEX(
          VALUES( DimDate[EnglishMonthName] ),
          [Total Product Cost]
)

If you are new to this type of calculation, you may experience the same thing as when you where looking at the old lady picture above, and focusing hard trying to find the young lady in the very same outline.  You may stay there for a while, but then it finally happens: AHA!.    If you create a list of unique month values at the year level, you can take each individual output – which we arrived at by summing – and then average at that level.  However, at the month level, the average will divide by 1 (the number of unique values at that level), so we keep the output SUM value.

Once the sudden re-framing and aha! moment occurs, you can’t stop seeing the pattern every time you look at the calculation.

In the end, one calculation can be interpreted by the xVelocity engine in multiple different ways, even without a single change on the expression. This is because the interaction of query and filter context can generate different output values using data across many tables at multiple granularities.

Some people seem to be pre-wired for this type of thought process.  To others, this doesn’t come very naturally but can definitely be achieved if there is persistence (that is a big if, considering many people in the business world have much more to do than learning a new tool/technology).

What I have noticed is that non-technical people that end up achieving some level of mastery of DAX is because at some point they are able to shift their attention away from the ‘immediate satisfaction’ of solving the problem at hand and dedicate a few extra minutes to experiment how functions behave under different conditions.  In other words, they become interested in the language itself, not just as an intermediary vehicle.

Not every analyst in the business world or technical person assigned BI responsibilities will have the time and interest to make this shift.  Interestingly, however, some of the people I have seen that most closely relate to this experimentation attitude are some of the subject matter experts that in many cases end up providing QA input for data warehousing projects.  In that sense, I believe the conditions for this technology to flourish are there even when it requires a non-linear thought process that may be foreign to some business users.

Running Total Techniques in DAX

Running Total (whether sum, average or any other aggregate) is one of the most common calculations used when analyzing business as it offers insights in relation to data trends. This article will describe some of the most common scenarios when developing running total calculations.

Note: The following examples use data from the PowerPivot model ‘Contoso Sample DAX Formulas’ available for download here.

Scenario # 1:  Year to Date (as a measure)

Tabular models offers built in DAX calculations that serve as shortcut for generating running total across a time dimension.

For example, the following calculation computes the YTD running total for any given expression over any particular year:

TOTALYTD( SUM( FactSales[SalesAmount] ) , DimDate[Datekey] )

image

Notice that the running total aggregate automatically resets itself at the beginning of each year.  In reality, this calculation is a less verbose variation of the a fundamental expression using CALCULATE:

CALCULATE( SUM( FactSales[SalesAmount] ), DATESYTD( DimDate[Datekey] ) )

Using CALCULATE has the added advantage that you can further modify the current filters in context when generating the running total calculation.  For example, the following DAX measure will compute the running total including only for products which had a price of under $5.00.

CALCULATE(
                      SUM( FactSales[SalesAmount] ),
                      DATESYTD( DimDate[Datekey] ),
                      DimProduct[UnitPrice] < 5
)

image

Scenario # 2:  Year to Date (as a Calculated Column)

As a calculated column over a table displaying aggregate values at the monthly level, we cannot leverage either TOTALYTD nor DATESYTD functions directly (as these require a column of dates). However, we can leverage the EARLIER function when comparing the current row in context against all other rows in the table: (This table is called FactMonthlySales)

CALCULATE(
    SUM( FactMonthlySales[Sales Amount] ),
    FILTER(
        ALLEXCEPT(
            FactMonthlySales,
            FactMonthlySales[Calendar Year]
        ) ,
    FactMonthlySales[Month] <= EARLIER( FactMonthlySales[Month] )
    )
)

image

Notice the Month has been prefixed with the Month Number.  This is necessary as the expression relies on these numbers when performing the comparison necessary to generate the running total. 

Scenario # 3:  Running Total Since Inception (as a Measure)

Another common calculation is the running sum since the beginning of the dataset.  Here is the DAX formula:

CALCULATE(     SUM( FactSales[SalesAmount] ),
    FILTER(
        ALL( DimDate) ,
        DimDate[Datekey] <= MAX( DimDate[Datekey] )
    )
)

Notice we should use MAX instead of a more intuitive LASTEDATE function.  This is because in this context, LASTDATE would evaluate to the last date of the entire dataset, instead of the last date of any given month.

The running total output can then be sliced by different categories to generate interesting comparisons. Below is the running sales amount total comparing orders associated to promotions of any type versus orders with no promotion.

image

Notice how the gap widens over time:  our promotion dollars are having a tangible impact in sales.

Notice also that, unlike traditional Excel formulas, the calculation does not need to change in order to arrive to this output:  We can keep slicing data by different attributes and the calculation will automatically recalculate based on the new filters (by territory, by product type, etc). This is what makes DAX formulas portable – an property that is not new for Business Intelligence professionals but can be very enlightening to Excel pros.

Scenario # 4: Running Total for Selected Years (as a Measure)

Your analysis may require a running total but only for selected years (or any other user selected date attribute). In this case the expression would be a very similar one:

CALCULATE( SUM( FactSales[SalesAmount])  ,
             FILTER(
                ALLSELECTED( DimDate),
                DimDate[Datekey] <= MAX( DimDate[Datekey] )
            )
)

By using the ALLSELECTED() function, the computation now generates a running total only for the years defined by the user:

image

Scenario # 5:  Running Total across all years (as a Measure)

A useful analysis when comparing trends for multiple years is to benchmark any given year against the average across all years. If we try using built-in YTD functions, however, the grand total will only reflect the most current year which is not what we need in this case:

image

Instead, we need to create a DAX formula that will compute – for the grand total – at the year level of granularity.  Only then we can achieve a real running average across all years:

[Rolling Total across All Years]:
=CALCULATE(SUM( FactSales[SalesAmount] )  ,
FILTER(
    SUMMARIZE(
                    all(DimDate),
                    DimDate[Calendar Month],                    
                    "date anchor", date("1999",left(DimDate[Calendar Month],2), "01")
                ), countrows ( filter (
    SUMMARIZE(
                    DimDate,
                    DimDate[Calendar Month],                    
                    "date anchor", date("1999",left(DimDate[Calendar Month],2), "01")
                ), earlier([date anchor]) <= [date anchor] ) ) )
)

This DAX measure uses a calculated table summarized at the monthly level. It is then able to execute this calculation for all the years in context:

image

The expression relies on a pre-existing column (DimDate[Calendar Month) that appends the month number to the month name (like the one used in scenario # 2), with format: “01 – January”.  This can be easily achieved in a calculated column if not present in the underlying data source.

Finally, we must divide the grand total by the year count in order to generate a valid running average.  For denominator, we can use the following DAX expression:

[Year Count]:
=COUNTROWS(
    FILTER(
        CROSSJOIN(
            VALUES(DimDate[CalendarYear] ) ,
            VALUES( DimDate[CalendarMonthLabel] )
        ) ,
    SUM(FactSales[SalesAmount])  <> BLANK() 
    )
)

With this formula as denominator, we can now generate a grand total row we can use to benchmark all years against:

[Rolling Total]:=[Rolling Total Across All Years] / [Year Count]

image

 

This allows for comparison and identification of yearly trends against overall average values, helping to narrow time periods in which performance was particularly good or bad:

image

Running totals are always interesting metrics in analytical reporting, and DAX allows for great degree of manipulation of calculation context which enables very custom aggregates over a Time dimension.

Profiling data in PowerPivot: comparing two tables

On a prior post, I described the process in which one could use PowerPivot to find duplicates when exploring a dataset.  Another common scenario when first diving into new data is finding common records across two different tables.   

Level: Easy

Problem:  You have two datasets and want to compare attribute values across them.  In concrete terms, you want to understand if a particular key value is present in the other dataset or not.

Relevancy: Comparing datasets is a common issue when modeling solutions in PowerPivot or BISM Tabular, as you must determine which table can and should be used as a lookup in your model.  For lookup tables to be valid dimensions, they must contain all members across tables and only contain unique values on the key column. The process of comparing tables can assist you with selecting the appropriate table to use as lookup.

Scenario: The following tables where sent to you as lists of orders. You want to know which table includes all orders.

image

Solution

One important test is to verify how many unique records are on each table.  This will allow us to quickly understand if there are duplicates on either one.

For this, two DAX measures can be used (two for each table):

CountRowsTable1:=COUNTROWS(Table1)

DistinctCountOrdersTable1:=DISTINCTCOUNT(Table1[Orders])

CountRowsTable2:=COUNTROWS(Table2)

DistinctCountOrdersTable2:=DISTINCTCOUNT(Table2[Orders])

Here are the results:

Table1
image

 

Table2
image

 

We now see there appears to be a duplicate value on Table1. We can now use two calculated columns (one each table) to further investigate:

TABLE 1

=CALCULATE( COUNTROWS( Table2 ),

               FILTER( Table2, Table2[Orders] = Table1[Orders] ) )

image

 

TABLE 2

=CALCULATE( COUNTROWS( Table1 ),

               FILTER( Table1, Table1[Orders] = Table2[Orders] ) )

 image

This calculation counts the rows on the other table, but only those rows that survived the filter applied:  an equality condition based on the Order number in the current row context.

With these results we can now see all values on Table1 are represented on Table2.  However, looking at the results on Table2 we see one value is not represented on Table1: Order # 100.  Not only that, but Order # 116 appears twice.

If we switch back to Table1 in the PowerPivot window, we can confirm this by filtering the column:

image

With this knowledge we can now be certain to select Table2 as our lookup (dimension equivalent) in the model, as it is the only one that has a comprehensive list of all order values and only unique entries.

In case neither table had a full list or order values, other techniques like the ones described here, here or here could be the applied to the same data in order to develop a valid PowerPivot (or BISM Tabular) lookup table, and in turn, a maintainable model.

Follow

Get every new post delivered to your Inbox.

Join 45 other followers