Running Product in DAX: Calculating Portfolio Returns

Recently a friend of mine asked me how to calculate investment rates of return in PowerPivot.  In order to do this, one must be able to run over a list of values using a multiplication. Currently, DAX does not have a ProductX or MultiplyX function to evaluate an expression in a specified row context followed by a final pass to multiply all computed values. I dealt with this issue in MDX years ago when attempting to do the same.  Although in that case I used the Multiply() .Net stored procedure to achieve the goal within the expression, I still recall seeing how Mosha pointed out a way to mimic a running multiplication using the following formula:

a * b = e^ ( LN(a) + LN(b) ) 

In DAX, we can leverage the same computation.  In simple terms, one can define a calculated column as:

image 

=POWER( 10,
    SUMX( 
        FILTER( Table1,
                Table1[Id] <= EARLIER( Table1[Id] )
            ),
        LOG( Table1[Values] )
    )
)

Notice we can leverage the SUMX expression over a table that grows based on the current row context, and calculate the LOG output which is summed at a second pass.  By raising the final scalar output to the power of 10, we achieve the running product effect:

image

Let’s now apply this technique to solve the question related to investment returns. Drawing from the common financial VAMI definition, we see that we must find a way to apply the logic below when determining returns over time:

Previous Rate of Return x ( 1 + Current Rate of Return )

At any given date, this formula will give us the rate at which the portfolio is performing. Notice that:

  • Although the original definition stipulates monthly returns, it is common to use it with daily returns as well.
  • The formula requires the use of a running multiplication
    I have placed the sample Excel 2013 file for this blog entry here, in case you want to follow the formulas by yourself.  The idea is we will build a hypothetical portfolio and calculate its rate of return.  For this, I downloaded closing prices from 1/3/2012 to 12/18/2012 for three stocks, and uploaded them into a PowerPivot model.

For simplicity, I did not account for dividends and I used the same date (1/3/2012) to establish a position and there were not subsequent buy orders.  Those items can surely be dealt with more DAX logic, but it is not necessary to simply illustrate the point of running multiplications.

I created a linked table with the info below. The quantity of each order serves as a parameter when using portfolio allocation to compute returns:

image

To first calculate the actual market value of the position, we use a measure defined as:

[Position]:=SUMX(
    VALUES(Prices[Ticker] ) ,
    CALCULATE(
        VALUES( Parameters[Quantity] ),
        FILTER(Parameters, Parameters[Ticker] =
EARLIER(Prices[Ticker] )  )
    )  *
    CALCULATE( MAX(  Prices[Close]  ), LASTDATE( Prices[Date] )  )
)

As we want the Portfolio Total (row total) to reflect the correct amount, we must calculate the value at the grain of the ticker symbol.  Additionally, as there is no actual relationship with the parameter table we must establish one through a calculation.  The result multiplies the order quantity for any given ticker symbol times the closing price for the trade date.  As we also want the column total to generate a valid dollar value, we use LASTDATE in the CALCULATE statement that fetches the closing price.  This is because MAX will get the maximum value for all days that make up the grand total row, but we want is the actual value for only the last date. A partial screenshot of the result shows the following:

image

We now want to compute the Daily Delta which is actually the daily rate of return.  For this it is necessary to get the last position value on the most current trade date (before today).  The position value, or market value, for trade date minus one is:

[Market Value T-1]:=CALCULATE(
    [Position],
    FILTER(
    ALLEXCEPT( Prices, Prices[Ticker] ),
    Prices[Date] =
        CALCULATE( MAX( Prices[Date] ),
        FILTER( ALLEXCEPT( Prices, Prices[Ticker] ),
            Prices[Date] < MAX( Prices[Date] )
            )
        )
    )
)

As we cannot be sure what T-1 will be, we must calculate the MAX date before today, and then retrieve the position value on that date:

image

The Daily Delta can now be easily calculated:

[Daily Delta]:=IFERROR( ([Position] – [Market Value T-1])/[Market Value T-1], BLANK() )

image

Notice at this point the Grand Total reflects the actual portfolio daily return, which is not a straight sum or average of the individual ticker symbol returns but instead is calculated at the total market value for the portfolio.

We are finally ready to apply the running product logic in order to determine compounded rates of return:

[Comp ROR]:=POWER( 10 ,
    SUMX(
        FILTER(
            ALL( Prices[Date] ),
            Prices[Date] <= MAX( Prices[Date] )
        ) ,
        LOG( 1 + [Daily Delta] )
    )
) – 1

image

For any given symbol or for the entire portfolio, we can now see what our percent gain or loss was at any given point during the life of the position. Charting this will help better understand the overall picture of our investments. In Excel 2013, it could look like this:

image

The entire report above is built using only the DAX functions discussed, as well as built-in Excel functionality (Moving average trendlines and cube functions are of great help when telling the story of investment performance over an Excel dashboard).  For example, to get the most current report date I used:

="(As of " & CUBEMEMBER("ThisWorkbookDataModel","[Prices].[Date].[All].LastChild") & ")"

Similarly, to get the portfolio return value (14.20 %), I simply queried the model from an worksheet cell with the following formula:

=CUBEVALUE("ThisWorkbookDataModel","[Measures].[Comp ROR]")

Finally, one of the beauties of Excel 2013 PowerPivot models is that there is no need to reopen the PowerPivot window when updating a linked table value.  Simply refreshing the dashboard through the Data tab will work to generate newly computed values.  As such, what-if scenario analysis is greatly simplified.  For example, and as you can see above from the Symbol % Allocation chart, YHOO was only a small part of the portfolio market value, however it had a great run at the end – which can be verified in the Symbol Returns chart.  If I changed the parameters of my report, and allocated a much bigger order to that symbol as in:

image

The returns will be now more interesting:

image

Also, you would notice the three moving averages pointing in one direction: up 🙂

Advertisements

Thinking in DAX: Non-linear Cognition at Play

Over the last year, I have had the opportunity to teach PowerPivot and DAX to a good number of technical and non-technical people in a variety of industries.  After each training, I recount what went right and what went wrong, and I have identified some clues that offer some insight on the way people assimilate this technology and ultimately decide whether or not to link themselves to it.

Perhaps a chart can help explain this better.  Let’s step aside from DAX for a moment, and think about a more traditional subject area which is typically taught in school: World History.  

When you are learning World History, each milestone builds on the prior one and there is an incremental process which can be graphed as a curve:

image 

At any give point in the curve you are one step ahead than the prior one.  This, of course, with the exception of very last phase when you have arrived to a theoretical moment of complete assimilation of all knowledge.

If you are learning, for example, about the European Union it is likely you will be building on earlier topics related to World War II. As you follow the chain of topics, you can realistically stop in any of them without harm.

DAX does not  follow this learning pattern.  First of all, even for the some of the most basic Time Intelligence functions like “Running Total since Inception”, there is a need to understand more advanced concepts of filter context.  Due to this, the pattern can instead be charted as the following:

image

DAX can appear simple (and easy) at first due to its similarities with Excel functions.  Fully additive measures can be created quickly. In some cases, PowerPivot even creates them automatically (such is the case of implicit measures).   And you may quickly hit the sudden understanding that DAX generates portable calculations that can be sliced and diced by other data attributes.

At that point, an analyst using PowerPivot may feel empowered and attempt to generate something a little more complex.  Something, for example, like a “Year over Year Growth” ratio.  As there are no built-in functions to get the answer directly, there is a need to understand how to build your own; for first time users, though, the most likely experience is that one getting stuck and experiencing a bit of frustration.

And here lies the power – and the current weakness – of DAX based tools: they are – for the most part – a blank canvas. To develop some of the advanced, non-additive analytical measures you will want to use, there is a need to understand quite a bit of how the engine interprets incoming computations.   But for those that take the time and interest to learn how the language actually works, doors open to a huge analytical arsenal made up of DAX functions and patterns that can be used to develop virtually any kind of calculation needed.

The flat horizontal lines on my “DAX Learning Pattern” chart happens when the person is cognitively stuck in the “I don’t get it! I just don’t see it!” phase.  During that time, they may feel no knowledge is being acquired. They may be working on a DAX problem for a long time with no feeling of being close to a solution.

The best way I can explain this is through a Gestalt example:

image

What do you see?  An old woman with a big nose?  Or a young woman turning to the side?  

Either way, once you see the first one it will take a bit of work – and concentration – to identify the other one.   You may stay looking at the picture for some time but when you finally see it, it will be so obvious you can’t help but seeing it it now.

In DAX, this can happen when learning about the many powers of the CALCULATE function, or when fully assimilating the fact filter context propagate in one direction by default, or even when understanding how to change the granularity of a calculation using an AggX function.

In essence, what is required here is non-linear thinking:  rather than straight step by step logic, what advances our level of DAX is a sequence of cognitive reframing scenarios in which the solution is assimilated based on sudden understanding – many times associated with a loud ‘AHA!’  -, rather than through paced and gradual knowledge acquisition.

Also note that not all horizontal/vertical lines in the DAX learning pattern chart have the same size.  This is because after a deep reframing that leads to better understanding, there may be a small gain in direct formula applications.  In other cases, a less dramatic reframing can have huge impact in formula applications.  Or vice versa. In a similar way, you may be stuck for a short or long time, when trying to grasp how to get a formula to work, or when analyzing a how a working formula actually computes.

Let’s see an example.  Many of you are already familiar with DAX but for the sake of this example, imagine you are learning about the most relevant functions and you hear their definition for the first time.  In that scenario, lets define a few of them:

AVERAGEX/SUMX – Takes a table and iterates through each row calculating an expression.  As a final pass, all row expression output values are averaged or added (depending on the function selected).

VALUES – Returns a one table column removing duplicates over the column that has been used as a parameter.

That is clear and relatively simple.  When testing those functions (in this case, using AdventureWorksDW), we get the Total Product Cost by using:

[Total Product Cost]:=SUMX(
       FactResellerSales,
       FactResellerSales[OrderQuantity] *
       FactResellerSales[ProductStandardCost]
)

Also, I could count the products with sales on any given month with the next expression:

[Count of Products With Sales]:=COUNTROWS(
   VALUES( FactResellerSales[ProductKey] )
)

In the fact table, we have many sales for any given product.  By using VALUES we get the list of unique products with sales. At this point, the implementation of those functions directly reflects the definition given above and the calculation output is straight forward.

Let say somebody asks now to generate a [Total Product Cost] calculation in which, at the monthly level is the default value (sum) but at the yearly level it should be the average of all monthly values

For people learning DAX, this could quickly translate into the flat line on the DAX learning pattern chart. Interestingly, after the example given above – they already know all they need to know in order to get this done.  But it takes some time for the idea to really ‘click’.

Here is the calculation:

AVERAGEX(
          VALUES( DimDate[EnglishMonthName] ),
          [Total Product Cost]
)

If you are new to this type of calculation, you may experience the same thing as when you where looking at the old lady picture above, and focusing hard trying to find the young lady in the very same outline.  You may stay there for a while, but then it finally happens: AHA!.    If you create a list of unique month values at the year level, you can take each individual output – which we arrived at by summing – and then average at that level.  However, at the month level, the average will divide by 1 (the number of unique values at that level), so we keep the output SUM value.

Once the sudden re-framing and aha! moment occurs, you can’t stop seeing the pattern every time you look at the calculation.

In the end, one calculation can be interpreted by the xVelocity engine in multiple different ways, even without a single change on the expression. This is because the interaction of query and filter context can generate different output values using data across many tables at multiple granularities.

Some people seem to be pre-wired for this type of thought process.  To others, this doesn’t come very naturally but can definitely be achieved if there is persistence (that is a big if, considering many people in the business world have much more to do than learning a new tool/technology).

What I have noticed is that non-technical people that end up achieving some level of mastery of DAX is because at some point they are able to shift their attention away from the ‘immediate satisfaction’ of solving the problem at hand and dedicate a few extra minutes to experiment how functions behave under different conditions.  In other words, they become interested in the language itself, not just as an intermediary vehicle.

Not every analyst in the business world or technical person assigned BI responsibilities will have the time and interest to make this shift.  Interestingly, however, some of the people I have seen that most closely relate to this experimentation attitude are some of the subject matter experts that in many cases end up providing QA input for data warehousing projects.  In that sense, I believe the conditions for this technology to flourish are there even when it requires a non-linear thought process that may be foreign to some business users.

Running Total Techniques in DAX

Running Total (whether sum, average or any other aggregate) is one of the most common calculations used when analyzing business as it offers insights in relation to data trends. This article will describe some of the most common scenarios when developing running total calculations.

Note: The following examples use data from the PowerPivot model ‘Contoso Sample DAX Formulas’ available for download here.

Scenario # 1:  Year to Date (as a measure)

Tabular models offers built in DAX calculations that serve as shortcut for generating running total across a time dimension.

For example, the following calculation computes the YTD running total for any given expression over any particular year:

TOTALYTD( SUM( FactSales[SalesAmount] ) , DimDate[Datekey] )

image

Notice that the running total aggregate automatically resets itself at the beginning of each year.  In reality, this calculation is a less verbose variation of the a fundamental expression using CALCULATE:

CALCULATE( SUM( FactSales[SalesAmount] ), DATESYTD( DimDate[Datekey] ) )

Using CALCULATE has the added advantage that you can further modify the current filters in context when generating the running total calculation.  For example, the following DAX measure will compute the running total including only for products which had a price of under $5.00.

CALCULATE(
                      SUM( FactSales[SalesAmount] ),
                      DATESYTD( DimDate[Datekey] ),
                      DimProduct[UnitPrice] < 5
)

image

Scenario # 2:  Year to Date (as a Calculated Column)

As a calculated column over a table displaying aggregate values at the monthly level, we cannot leverage either TOTALYTD nor DATESYTD functions directly (as these require a column of dates). However, we can leverage the EARLIER function when comparing the current row in context against all other rows in the table: (This table is called FactMonthlySales)

CALCULATE(
    SUM( FactMonthlySales[Sales Amount] ),
    FILTER(
        ALLEXCEPT(
            FactMonthlySales,
            FactMonthlySales[Calendar Year]
        ) ,
    FactMonthlySales[Month] <= EARLIER( FactMonthlySales[Month] )
    )
)

image

Notice the Month has been prefixed with the Month Number.  This is necessary as the expression relies on these numbers when performing the comparison necessary to generate the running total. 

Scenario # 3:  Running Total Since Inception (as a Measure)

Another common calculation is the running sum since the beginning of the dataset.  Here is the DAX formula:

CALCULATE(     SUM( FactSales[SalesAmount] ),
    FILTER(
        ALL( DimDate) ,
        DimDate[Datekey] <= MAX( DimDate[Datekey] )
    )
)

Notice we should use MAX instead of a more intuitive LASTEDATE function.  This is because in this context, LASTDATE would evaluate to the last date of the entire dataset, instead of the last date of any given month.

The running total output can then be sliced by different categories to generate interesting comparisons. Below is the running sales amount total comparing orders associated to promotions of any type versus orders with no promotion.

image

Notice how the gap widens over time:  our promotion dollars are having a tangible impact in sales.

Notice also that, unlike traditional Excel formulas, the calculation does not need to change in order to arrive to this output:  We can keep slicing data by different attributes and the calculation will automatically recalculate based on the new filters (by territory, by product type, etc). This is what makes DAX formulas portable – an property that is not new for Business Intelligence professionals but can be very enlightening to Excel pros.

Scenario # 4: Running Total for Selected Years (as a Measure)

Your analysis may require a running total but only for selected years (or any other user selected date attribute). In this case the expression would be a very similar one:

CALCULATE( SUM( FactSales[SalesAmount])  ,
             FILTER(
                ALLSELECTED( DimDate),
                DimDate[Datekey] <= MAX( DimDate[Datekey] )
            )
)

By using the ALLSELECTED() function, the computation now generates a running total only for the years defined by the user:

image

Scenario # 5:  Running Total across all years (as a Measure)

A useful analysis when comparing trends for multiple years is to benchmark any given year against the average across all years. If we try using built-in YTD functions, however, the grand total will only reflect the most current year which is not what we need in this case:

image

Instead, we need to create a DAX formula that will compute – for the grand total – at the year level of granularity.  Only then we can achieve a real running average across all years:

[Rolling Total across All Years]:
=CALCULATE(SUM( FactSales[SalesAmount] )  ,
FILTER(
    SUMMARIZE(
                    all(DimDate),
                    DimDate[Calendar Month],                    
                    "date anchor", date("1999",left(DimDate[Calendar Month],2), "01")
                ), countrows ( filter (
    SUMMARIZE(
                    DimDate,
                    DimDate[Calendar Month],                    
                    "date anchor", date("1999",left(DimDate[Calendar Month],2), "01")
                ), earlier([date anchor]) <= [date anchor] ) ) )
)

This DAX measure uses a calculated table summarized at the monthly level. It is then able to execute this calculation for all the years in context:

image

The expression relies on a pre-existing column (DimDate[Calendar Month) that appends the month number to the month name (like the one used in scenario # 2), with format: “01 – January”.  This can be easily achieved in a calculated column if not present in the underlying data source.

Finally, we must divide the grand total by the year count in order to generate a valid running average.  For denominator, we can use the following DAX expression:

[Year Count]:
=COUNTROWS(
    FILTER(
        CROSSJOIN(
            VALUES(DimDate[CalendarYear] ) ,
            VALUES( DimDate[CalendarMonthLabel] )
        ) ,
    SUM(FactSales[SalesAmount])  <> BLANK() 
    )
)

With this formula as denominator, we can now generate a grand total row we can use to benchmark all years against:

[Rolling Total]:=[Rolling Total Across All Years] / [Year Count]

image

 

This allows for comparison and identification of yearly trends against overall average values, helping to narrow time periods in which performance was particularly good or bad:

image

Running totals are always interesting metrics in analytical reporting, and DAX allows for great degree of manipulation of calculation context which enables very custom aggregates over a Time dimension.

Profiling data in PowerPivot: comparing two tables

On a prior post, I described the process in which one could use PowerPivot to find duplicates when exploring a dataset.  Another common scenario when first diving into new data is finding common records across two different tables.   

Level: Easy

Problem:  You have two datasets and want to compare attribute values across them.  In concrete terms, you want to understand if a particular key value is present in the other dataset or not.

Relevancy: Comparing datasets is a common issue when modeling solutions in PowerPivot or BISM Tabular, as you must determine which table can and should be used as a lookup in your model.  For lookup tables to be valid dimensions, they must contain all members across tables and only contain unique values on the key column. The process of comparing tables can assist you with selecting the appropriate table to use as lookup.

Scenario: The following tables where sent to you as lists of orders. You want to know which table includes all orders.

image

Solution

One important test is to verify how many unique records are on each table.  This will allow us to quickly understand if there are duplicates on either one.

For this, two DAX measures can be used (two for each table):

CountRowsTable1:=COUNTROWS(Table1)

DistinctCountOrdersTable1:=DISTINCTCOUNT(Table1[Orders])

CountRowsTable2:=COUNTROWS(Table2)

DistinctCountOrdersTable2:=DISTINCTCOUNT(Table2[Orders])

Here are the results:

Table1
image

 

Table2
image

 

We now see there appears to be a duplicate value on Table1. We can now use two calculated columns (one each table) to further investigate:

TABLE 1

=CALCULATE( COUNTROWS( Table2 ),

               FILTER( Table2, Table2[Orders] = Table1[Orders] ) )

image

 

TABLE 2

=CALCULATE( COUNTROWS( Table1 ),

               FILTER( Table1, Table1[Orders] = Table2[Orders] ) )

 image

This calculation counts the rows on the other table, but only those rows that survived the filter applied:  an equality condition based on the Order number in the current row context.

With these results we can now see all values on Table1 are represented on Table2.  However, looking at the results on Table2 we see one value is not represented on Table1: Order # 100.  Not only that, but Order # 116 appears twice.

If we switch back to Table1 in the PowerPivot window, we can confirm this by filtering the column:

image

With this knowledge we can now be certain to select Table2 as our lookup (dimension equivalent) in the model, as it is the only one that has a comprehensive list of all order values and only unique entries.

In case neither table had a full list or order values, other techniques like the ones described here, here or here could be the applied to the same data in order to develop a valid PowerPivot (or BISM Tabular) lookup table, and in turn, a maintainable model.

Automate lookup table maintenance when using multiple Excel sources in a PowerPivot model

When developing a PowerPivot model, one of the most important tasks is deciding which tables will serve as lookups.  The urgency is due the fact that those tables define important elements as default granularity of aggregation as well as direction of filter context propagation(this last one explained here).  Without valid lookups in your model, developing a PowerPivot report can be a bit frustrating as you constantly hit walls that arise when breaking fundamental rules of its calculation engine.

I explained here a solution for creating lookup tables when the model uses data that exists in Excel worksheets.  The solution works well to satisfy one-time ad hoc reporting requests, or when building a BISM Tabular prototypes.

However, what if you want to avoid having to manually maintain the values that make up those lookup tables?  By using a slight variation of the data merging method detailed by Debra Dalgleis here, we can achieve this automation. In this case, though, we won’t be merging files with identical structures but instead selecting common columns across datasets with the explicit objective of generating lookup tables.

Suppose we have two Excel file sources:

ORDERS1.XLSX

image

and ORDERS2.XLSX

image

Note that we have two files with different structure (column names, and total amount of columns).  If we wanted to tie both tables in a PowerPivot model (for example, using the common Order Date column on both sides) we would get the famous error:

image

As already mentioned, what we need is to create a valid lookup table which serves as an intermediary between these two base tables (base, as neither one can serve as a lookup). In this case, instead of using the ‘remove duplicates’ feature in Excel, we are going to leverage the ability to use SQL syntax when querying multiple Excel files.

Add a connection on the Excel window (not the PowerPivot window), by browsing to the Excel file source

image

In the PowerPivot window, open the connection to the workbook by going to ‘Existing connections’, ‘Workbook Connections’:

image

When opening that workbook connection, change the friendly name of the in-memory table to ‘Date Lookup’.

Open the ‘Table Properties’ menu on the PowerPivot window, switch from ‘Table Preview’ to ‘Query Editor’ and change the query to:

SELECT [Order Date]  FROM [Sheet1$]
UNION
SELECT [Order Date] FROM `C:\temp\Orders2.xlsx`.[Sheet1$]

With the result of this query, we can now designate a valid lookup table that is used to connect all other tables appropriately:

image

We could now generate DAX artifacts to consolidate values across the model.  For example, the following calculated column can compute a Order Count across all dates (in both original tables):

=CALCULATE(

               COUNTROWS( VALUES(Orders1[Order Number] ) )

               +

                COUNTROWS( VALUES( Orders2[Invoice Number] ) )

)

image

Note at this point that unlike the technique described by Debra,  the structure of the files here was not identical: one dataset listed order numbers whereas the other one listed invoice numbers. Using a combination of custom SQL syntax and DAX, we are able to achieve the structure needed to consolidate both lists of values into a comprehensive list that serves as a perfect candidate for a lookup (dimensional) table in our model.

This method provides a way to better automate PowerPivot reports that get data outside of traditional corporate repositories. Many systems can be automated to generate Excel ‘data dumps’.  Traditionally, this have been a challenge when dealing with PowerPivot models with clearly defined dimensional schemas.  This technique, however, provides a viable method of maintaining values of lookup tables without the need for manual intervention — as new data is added to either file the load query is able to automatically capture all new values from both sides and in doing so, regenerate the lookup table.

Profiling data in PowerPivot: Detecting duplicates

Generally – when you are given an unexplored dataset – your first task should be to identify some of its properties in order to make a more informed decision regarding the accuracy of reporting you can achieve with it.  Many Business Intelligence professionals have robust tools at their disposal that can be used to deeply explore a dataset; however, data analysts can use the inherent capabilities of PowerPivot to get the job done.

One of the most common data profiling assessments is identifying duplicate values.  This is important as we want to decide on what sets of data can be good candidates for generating lookup tables in our model; And this is of particular relevancy when querying transactional sources that do not have clearly defined dimensions.

For example, say you want to load a product table into your model. In this case you are actually querying the enterprise data warehouse and the table you are using stores data for a slowly changing dimension of type II.  For this reason it can have legitimate reasons for having duplicate product names: it is storing the product attributes as they have changed over time:

image

Notice the highlighted product (AWC Logo Cap) has three entries on that table.  Over time, the product had different standard costs associated with it and the data warehousing team wanted to keep all history associated to past records.

If we intend, however, to mesh this data with another dataset that somebody sent to us (which did not come from the data warehouse and only has product name on it – not product key), we would want to know

a) are there duplicates on this table. We want to ensure this table can be used to generate a valid lookup table in the model, and PowerPivot requires unique values on the key columns of the lookup table when establishing relationships

b) If there are duplicate values, are they a data quality problem or based on legitimate reasons.

In our example, we know there are no data quality issues and we could easily generate a valid lookup out this table simply by going to the table import wizard and selecting the [EnglishProductName] field from while specifying a ‘Grouped by’ aggregation (using the same techniques detailed on this blog entry), or by issuing a T-SQL DISTINCT clause on the underlying query.

If the duplicates are based on data quality issues, we could instead ask the data source owner to fix them directly, even before it makes it to PowerPivot.

So now that we have determined that it is worthwhile identifying which records are duplicate, we need to know how to do this quickly using DAX.

On the dataset above, this can be done using the following DAX calculation on a new column:

=CALCULATE( COUNTROWS( DimProduct ),

ALLEXCEPT( DimProduct, DimProduct[EnglishProductName] ) )

In other words, we want to count all rows on the table by establishing a filter context in which only the column we are testing is on it:

=CALCULATE( COUNTROWS( ‘Table’),

ALLEXCEPT( ‘Table’, ‘Table’[Column Tested for Duplicates] ) )

image

It is easy then to filter that column to only detect the products with more than one entry on that table. Notice how the column filter also quickly tells us there are products with 2 entries and others with 3 entries:

image

I used DAX’s ALLEXCEPT function, as we only wanted to test duplicates over one product attribute ([EnglishProductName]).  Had we been detecting for duplicates on a one-column table (one with only the column we are testing), we could have avoided this:

=CALCULATE( COUNTROWS( ‘DimProduct’ ) )

image

This is because on a one-column table, there is no need to exclude other attributes that can lead to duplicates.  On the other hand, when other columns exist, there is always the risk of having attributes with different values for the same dimension member (products, in this case).  An example would be the one mentioned above in which the product standard cost has changed over time.  As such, we must remove all those attributes from the filter context when we want to narrow our investigation to only duplicate names.

Once you have identified the problematic records, it will be easier to communicate an issue to the data source owner and more rapidly identify a course of action.  Even in the case you decide to use this table as a lookup in your data model, you will be now aware of potential future issues for which you can proactively plan for.

Help us promote SQL Saturday 174 and get free training in return!

If you intend to attend the SQL Saturday ‘BI’ Edition in Charlotte, NC on October 27, you may be the lucky winner of an *Annual Plus* subscription from PluralSight, a leading developer training company (valued at $499).

How?  Help us promote the event by sending an email to your friends and colleagues encouraging to attend.  Please be sure to cc SQLSaturday174@SQLSaturday.com so we know you are spreading the word (Be sure to direct your contacts to register on our event website @ http://www.sqlsaturday.com/174/eventhome.aspx )

Send the email between now and midnight on Sept. 30th and you will be entered in a drawing to win the free training.  We’ll do the drawing at the SQL Saturday closing remarks.

[your name can only be entered once and you must be present to win]

Thank you and I hope to see you there! 🙂

image

VLOOKUP conditional flow equivalent in DAX

Recently I was browsing an Excel forum and someone asked an interesting question.  How can one write a formula in which a value must be found in a lookup table, and if none is found then instruct the expression to retrieve a value from another lookup table?

In Excel, this can be done using a formula like this one:

=IFERROR(VLOOKUP(A1,TableA,2,FALSE),VLOOKUP(A1,TableB,2,FALSE))

Why would we want to use this kind operation? If, for example, we are consolidating a table of account balances.  Given the existence of multiple ‘account types’, not all account might be stored in the same table.  For example:

image

The behavior in Excel relies on the fact that an unmatched VLOOKUP condition will return an error (#NUM).  In DAX, however, an unmatched RELATED() function simply returns an empty value.  As such, an equivalent DAX expression would not yield the appropriate values.

This is the Tabular data model we are working with:

image

The equivalent DAX calculated column would yield the following (in the ‘Total List of Accounts’ table):

=IFERROR(

  RELATED(‘Credit Cards'[balance]), RELATED(‘Checking Accounts'[balance]) )

image

In other words, only on relation gets resolved.  Armed with the knowledge that RELATED() returns empty values for unmatched elements, we are tempted to use ISBLANK() in the expression:

=IF(
ISBLANK(RELATED(‘Credit Cards'[balance])),
RELATED(‘Checking Accounts'[balance]),
RELATED(‘Credit Cards'[balance])
)

However, this yields the same result as the one one shown above.  In my opinion, this is a bug and have opened a connect item about it (you can vote for it here ).

So how can this be solved?    As a calculated column, we can simply concatenate the output of both RELATED() calls to get the result we need:

=RELATED(‘Credit Cards'[balance]) + RELATED( ‘Checking Accounts'[balance] )

image

if you suspect you may have an identifier that is present in both tables, it will be necessary to understand what business rule should be used to prefer one over the other one.  If, as an arbitrary example, we wanted to prefer credit cards over checking accounts, the expression would be:

=IF(
CALCULATE( COUNTROWS(‘Credit Cards’ ) ),
RELATED( ‘Credit Cards'[balance] ),
RELATED( ‘Checking Accounts'[balance] )
)

If a measure was need instead – perhaps because you need the balance to be used over an extended expression leveraging the query context – , the formula would need only to use a function that generates row context like SUMX:

=SUMX(
‘Total List of Accounts’,
IF(CALCULATE( COUNTROWS(‘Credit Cards’ ) ),
RELATED( ‘Credit Cards'[balance] ),
RELATED( ‘Checking Accounts'[balance] ) )
)

image

PowerPivot and Beyond: Expanding your analytical repertoire

One of those PowerPivot features that is no very well known is that PowerPivot can be used as a database that can serve many other tools besides Excel.   

I blogged about it a little bit here when I mentioned that PowerPivot can be consumed through other tools like Tableau.  However, I think is worth expanding on the topic a bit as this feature can be very useful for people migrating from “Personal BI” to “Team BI” or even “Corporate BI”.

Imagine a scenario in which you have developed a PowerPivot workbook and have then published it to a PowerPivot Gallery in SharePoint.  Your report has now proven to be a success and its popularity has been rising recently across your company.   A business unit manager now wants to distribute this report to his team members, but filtering it to only the data that pertains to each of them.  The PowerPivot workbook has a big amount of data, demanding the 64-bit version on PowerPivot.  The manager as well as his team members cannot download the workbook and use it locally as they only have PowerPivot 32-bit version installed and due to corporate policies they do not expect to be upgraded soon.  

One option is for the manager to manually copy and paste the pivot table displayed on the browser into a separate excel file.  He would have to do this, every day, for each team member as he can only sent the portion of the data that belongs to that specific employee.  Not a great solution.

The other option would be to create a separate PowerPivot workbook for each of the team members with only his data.  Not a great solution either as the data model and DAX calculations would have to be replicated, and if you need to change one of them one day, you would have to remember to change it in each of the replicas.

What to do?  I want to point out that you can read a PowerPivot that has been published to SharePoint from a great number of other tools besides Excel. Some of these tools can enhance the PowerPivot implementation, giving us features to solve the scenario in question. 

For this example, let’s stick with Microsoft reporting tools.  Besides Excel itself, you can read PowerPivot data from Report Builder, SQL Server Reporting Services and PerformancePoint.  All of them have the capability to read from published PowerPivots.  SQL Server Reporting Services (SSRS), for example, will give you additional capabilities that will allow you to leverage your PowerPivot workbook but offer more advanced reporting layouts and delivery mechanisms, this last feature being what we need for our example.

When you read PowerPivot from another tool, you are not reading the pivot table in the workbook but the entire data model available in the PowerPivot window. This is true even for Excel itself, being a client to the PowerPivot database.

Lets continue with the scenario given.  This is how my PowerPivot looks like in SharePoint, once I have published it:

image

So far so good.  This is what most of the business users want to see and interact with.  

But other than the nice Silverlight interface, something else happens when the PowerPivot was published and someone interacted with it (via slicers or filters) or by scheduling an automatic data refresh:  a real database is generated in the form of a SQL Server Analysis Services BISM Tabular model.  You can verify this by opening up the dedicated SQL Server Analysis Services instance which the PowerPivot add-in for SharePoint uses (this is a requirement for the add-in to work).  In the case of my published report, I see the following through SQL Server Management Studio (SSMS):

image

The highlighted database is quite literally my published workbook, in database form. If you happen to know some MDX you can then open a new query window and query the PowerPivot database (called a “model”) using this language.  Note that you can be querying, in MDX, a measure that was developed in DAX – like in the case below.  Additionally, you can create your own MDX calculated members as part of the query if necessary.

image

if you feel more familiar with DAX, though, you can query the PowerPivot model with it instead of MDX:

image

The point here is that PowerPivot is now stripped out of its Excel dependency and can be now queried from by many other tools (When SharePoint renders the workbook on the browser, it uses this SQL Server Analysis Services model with an Excel Services front end; however the Excel desktop application is not needed anymore).

  Going back to our scenario, we will be building a solution in SQL Server Reporting Services that will take a parameter and deliver a scheduled report based on filtered data coming from the PowerPivot model.

If you are not familiar with SQL Server Reporting Services (SSRS), I would recommend you start with Report Builder 3.0 which is a more data analyst centric version of the full blown SSRS product, yet it leverages the same underlying technology.

When connecting SSRS to PowerPivot, remember to always do it through the URL that hosts the workbook in SharePoint.  This is a very important point, as when you decide to enhance your PowerPivot workbook and republish it, a new cache copy will be created in SQL Server Analysis Services.  By using the URL instead of a direct connection to the model, you will be certain to always consume the ‘latest & greatest’ version of the workbook.

In Visual Studio 2010, you can create a new SSRS data source by selecting the Analysis Services provider:

image

Click on Edit and use the SharePoint location of the PowerPivot workbook to select the Model:

image

With the connection to PowerPivot now created, you can use the MDX query editor to drag and drop the elements needed for the report.  In this case, I have created a parameter based on employee email, as I will be using this for generating a filtered version of the report for each employee I am interested on:

image

Below is how the final PowerPivot-connected SSRS report looks like.  The currency amounts shown are associated only to the employee in question (filtered by his email account, selected with the dropdown at the top of the report):

image 

We can now deploy the report to the report server:

image

In order to schedule an automated run that goes a number of email values and delivers the report with filtered data we must configure an SSRS snapshot.  Notice that by doing so we are now using capabilities beyond those of a pure PowerPivot implementation.

There are many websites that explain the process of generating an SSRS subscription so I won’t go into a lot detail here, but I was want to show how the process would look like in general terms.

What we want to do at this point is select the report and click on ‘Manage’.  That will give you access to a set of features built into the SSRS framework. 

image

First, select ‘data sources’.  On it, be sure to specify the account that will be used to query the PowerPivot model:

image

After that, click on ‘Subscriptions’ and then on ‘New Data Driven Subscription’.  Following that, give your new subscription a name and specify a data source that contains recipient information.  This data source does *not* need to be in PowerPivot, it can be a table in a SQL Server database that specifies the information about where and how to deliver the automatic snapshot.

In my case, I decided for the creation of a file on a network share instead of email delivery.  As such, the query specified next in the wizard should contain that information, as well as specifying how the report parameter (employee email address) should be set on each iteration of the snapshot:

image

Here is how my SSRS_Subscription table looks like:

image

In other words, I want to create two files:  One called ‘JavierSnapshot’ which was the report filtered for the email address amy0@adventure-works.com  and another one called ‘LeeSnapshot’ which should filter the report for garrett1@adventure-works.com.  These are just examples, in your case you may want to use specific output files mapping, for example, sales manager with sales territories.

The SSRS subscription wizard now instructs us to map the output of the table with the chosen delivery attributes (file delivery in this case), as well as mapping output columns with report parameters:

image

At the end, the wizard asks you for the schedule on which you want this process to execute. 

Here is what the output looks like, in my case, after the subscription has automatically executed:

image

In summary, each PDF was generated by the SSRS report querying the PowerPivot model, but only filtering for a specific employee and automatically delivering the output to a folder in the network.  Pretty useful functionality that can enhance the options available to your self-service environment.

Determining lost or acquired customers in DAX

I was asked recently how to determine which customers where acquired or lost from last month to the current month.  Here is a small sample of what a fact table would look like, registering customers that ordered over time:

image

Three interesting questions that can be asked are:

  • What is the amount of returning customers?
  • Which customers where added on the last month?
  • Which customers where lost on the last month?

We can solve these questions by extending the pattern outlined by Chris Webb here.  But before we apply the pattern, it is necessary to generate a real date column.  This can be easily done by creating a calculated column with the following simple formula

=[Month]

after which we change the data type to ‘date’, which in effect converts the text value into a date field:

image

We can now apply a similar pattern that one Chris describes:

Returning Customers:=IF( HASONEVALUE( Table1[Date]  )  , IF( DATEADD( VALUES( Table1[Date]), -1, MONTH ) <> BLANK(),
    COUNTROWS(
        CALCULATETABLE(
            DISTINCT( Table1[Customer Name] ),
            CALCULATETABLE( DISTINCT( Table1[Customer Name]  ),
                DATESBETWEEN( Table1[Date], BLANK(),
                    DATEADD( VALUES( Table1[Date]), -1, MONTH )
                    )           
                , ALL(Table1) )
            )
    )
) )

The result is a DAX measure that will give us the accurate count of customers that had purchased in the past and the have decided to buy from us again (Mary Evans and John Rogers):

image

This expression stacks two filtered tables comprising two different contexts:  the expression highlighted in red gives us the current customers,  the expression highlighted in blue gives us the list of customers for the past month.  In DAX, the parameters used in a CALCULATE or CALCULATETABLE statement are then intersected and the resulting output are the common rows across both filtered tables.

Lets move on now to the question of how to determine which customers where acquired this month.  We can use a similar technique but rather than leveraging the default intersect behavior when modifying the filter context via CALCULATE or CALCUALTETABLE, we must now generate an except condition between two sets of data:   we want to count all customers that purchased this month except the ones that had purchases last month:

Acquired Customers:=IF(HASONEVALUE( Table1[Month] ),
    IF( DATEADD( VALUES(Table1[Date]) , -1, MONTH) <> BLANK(), 
    CALCULATE(
        COUNTROWS(VALUES( Table1[Customer Name] ) ),  
           FILTER(
                VALUES(Table1[Customer Name]), 
                    Table1[Customer Name] <>
                    CALCULATETABLE ( VALUES( Table1[Customer Name] ),
                        FILTER( all(Table1[date]), Table1[Date] =
                            DATEADD( VALUES(Table1[Date]) , -1, MONTH)
                        ), ALL(Table1[Month]) 
                    )         
                )

            )            
        )
    )

It is normally assumed we use a single value when passing parameters to the FILTER, however, “except” conditions can be generated by comparing two filtered tables as well. The result is the following:

image

Notice that [Returning Customers] + [Acquired Customers] = [Total Customers].  So why not just subtract the total amount of distinct customers from the [Returning Customers] measure to the [Acquired Customers] value? I would say most of the times that what you need to do. However, what if we want to see exactly who those new customers where?  The expression above can be translated into a DAX query, and whether we use SSMS, DAX Studio or a reporting tool like SSRS, we can then visualize the exact customers that made up that count:

image 

resultset viewed in DAX Studio

Now lets solve the final question.  For this we use the same technique as the one outlined above, but reversing the parameters:  we want to know which customers had sales last month except the ones that purchased this month. This will give us the count of lost customers:

Lost Customers:=IF(
    HASONEVALUE( Table1[Date] )   ,
    COUNTROWS ( FILTER ( CALCULATETABLE ( VALUES( Table1[Customer Name] ),
                FILTER( ALL(Table1[date]), Table1[Date] =
                DATEADD( VALUES( Table1[Date] ), -1, MONTH )
                ), ALL(Table1[Month]) 
                ) ,
            Table1[Customer Name] <>            
            CALCULATETABLE( VALUES ( Table1[Customer Name]  ) )           
        )
  )
)

the result is the following:

image

Here, again, we can view the exacts customers we lost through a modified version of the measure above applied as a DAX query, which could serve as a drilldown view of the totals seen above:

image

resultset viewed in DAX Studio