Nurturing Model Growth: Prototyping ETL using Excel sources

The BISM Tabular framework’s extensibility does not only imply a move from personal PowerPivot models to corporate Tabular models but also an ability to ‘branch out’ processes into more sophisticated layers of ETL and/or reporting.

Over the last months I have witnessed PowerPivot models that ‘take a life of their own’, and through their growing popularity become more sophisticated as they evolve. As their data load facilities develop further, they offer the ability to prototype ETL over a target in-memory schema, while testing assumptions regarding the appropriate data model to use.

I call these ‘seed models’ ::  that is, individual PowerPivot models that have potential for organic growth beyond its original intent – all the way into enterprise Semantic Models.  As they grow, they themselves can become the corporate standard or instead be assimilated into pre-existing Tabular models. 

image

The process of nurturing model growth (from individual to enterprise level) is a fascinating task.  This blog post explores one aspect of model expansion – from Excel files (a very common source for data analysts) to SSIS based ETL. 

It is interesting to note this expansion happens at the pace that makes most sense for the organization given its current BI maturity, and at every stage of the process there is tangible benefits to the business.

GROWTH STAGE #1:  SELF SERVICE BI BASED ON ‘DATA DUMPS’

Here you find Self-Service BI scenarios in which analysts utilize extracts from multiple systems to generate their consolidated reporting.

Assume the following datasets:

ProductList.xlsx

image

 

ProductSales.xlsx

image

 

As you can see, the second workbook (ProductSales.xlsx) contains data that can be mapped back to the first one (ProductList.xlsx).  However, it is not just a fact sales table, it also contains dimension data (Product Name and Model). 

In the PowerPivot/Tabular world (just as in traditional data warehousing) it is recommended to structure data conforming logical entities. By executing a SQL query over these two Excel files it is possible to generate the necessary lookups to develop a user friendly dimensional model.

To do this, it is necessary to create a connection to the files outside of the PowerPivot window (this enables us to execute SQL queries over Excel files).

Go to Data –> Connections and click ‘Add to the Data Model’. When the next window appears, click on ‘Browse for More’

image

Find ProductList.xlsx and load it into the model.  Now follow the process again for the ProductSales.xlsx file, but this time do not add it to the data model, simply add the connection.

image

In the PowerPivot window, go to the Design tab and select ‘Table Properties’.  You will notice a prompt asking for a password, which you can simply ignore by clicking OK:

image

This prompt is not ideal as it can get on the way of automated refreshes.  To remove it, it is necessary to change the connection string,

From:

Provider=Microsoft.ACE.OLEDB.12.0;User ID=Admin;Data Source=C:\blog\ProductList.xlsx;Mode=Share Deny Write;Extended Properties="HDR=YES;";Jet OLEDB:System database=;Jet OLEDB:Registry Path=;Jet OLEDB:Engine Type=37;Jet OLEDB:Database Locking Mode=0;Jet OLEDB:Global Partial Bulk Ops=2;Jet OLEDB:Global Bulk Transactions=1;Jet OLEDB:New Database Password=;Jet OLEDB:Create System Database=False;Jet OLEDB:Encrypt Database=False;Jet OLEDB:Don’t Copy Locale on Compact=False;Jet OLEDB:Compact Without Replica Repair=False;Jet OLEDB:SFP=False;Jet OLEDB:Support Complex Data=False;Jet OLEDB:Bypass UserInfo Validation=False;Jet OLEDB:Limited DB Caching=False;Jet OLEDB:Bypass ChoiceField Validation=False

 

To simply:

Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:\blog\ProductList.xlsx;Extended Properties="Excel 12.0 Xml;HDR=YES"

Remember, in Excel 2013, once the connection has been modified in the PowerPivot window it is not possible to edit it (in Excel 2010 this is possible).  In other words, be sure to implement the change prior to doing any PowerPivot model work.

Warning:  Automated data refresh using Excel sources in PowerPivot for SharePoint is not currently supported (see here).  However, the refresh will work and under your discretion of ensuring controlled access to the source files it is possible to generate a solution that provides much value to the business. 

Now click on ‘Switch to Query Editor’:

image

By joining both files, we are able to mimic the type of queries we commonly use as part of ETL processes to define target dimension tables.  This process involve defining a ‘driving’ table, commonly associated with the master list of dimension attributes, which is left joined to other tables which may contain additional attributes (a common pattern in ETL development).

The SQL query to use is the following:

SELECT
[ProductList$].*,
[ProductSales$].[Product Name],
[ProductSales$].[Product Model]  
FROM ([ProductList$]
    LEFT OUTER JOIN `C:\blog\ProductSales.xlsx`.[ProductSales$] 
ON
    [ProductList$].[Product ID] = [ProductSales$].[Product ID])
WHERE
    (
    [ProductSales$].[As of Date] =
        (SELECT MAX([ProductSales$].[As of Date]) FROM `C:\blog\ProductSales.xlsx`.[ProductSales$] )
    )

The use of this query pattern gives the PowerPivot data model an important artifact: a lookup table. Generating valid lookup tables is a big part of Tabular development, as they will keep DAX expressions simple and prevent unexpected context propagation behaviors.

This new lookup table will represent an SCD Type 1 dimension.  I argue this is a good way to start proving business value, in many situations a predecessor to enabling historical tracking.

By loading the ProductSales.xlsx table, we now arrive at the following dimensional schema:

image

generated SCD Type 1 dimensional schema

With the following result:

image

 

GROWTH STAGE #2:  MOVING TO RELATIONAL SOURCES

Once the model is proven popular and useful to the business,  the BI team may decide to take the ETL prototype query and standardize it in a SQL view:

CREATE VIEW [dbo].[v_ProductList] AS
SELECT       
        ProductList.*,
        ProductSales.[Product Name],
        ProductSales.[Product Model]
FROM           
        ProductList LEFT OUTER JOIN
        ProductSales ON ProductList.[Product ID] = ProductSales.[Product ID]
WHERE       
    ProductSales.[As of Date] =
    (
        SELECT MAX(ProductSales.[As of Date])
            FROM ProductSales
    )
GO

Notice the pattern used would be very similar to the one used to create a valid dimension (lookup). In the model, we can now change the connection to point to this view and get rid of the Excel source dependency. To do this, we must change the connection used to populate the tables from a file source to a SQL Server source.

Go to ‘Existing Connections’j, select ProductList.

image

By clicking ‘Edit’, it is now possible to change the connection type by clicking ‘Build’.  When the ‘Data Link Properties’ window opens, change the provider to ‘SQL Server Native Client 11.0’.

image

At this point, you can connect to the SQL Server source and specify the database that contains the extracted datasets.

image

Once the connection is mapped back to SQL Server, we must change the query definition that makes up the ProductList. This can be done in using the ‘Table Properties’ option of the model.

image

Given the maturity of the implementation, it may be advantageous to upgrade the PowerPivot model to BISM Tabular, while also developing more sophisticated ETL.

 

GROWTH STAGE #3:  ENABLING HISTORICAL TRACKING

The business may decide it would be beneficial to enable historical attribute tracking to create more sophisticated reporting.   At this point, we can use a variation of the query we have discussed in order to initialize a target data warehouse dimension “DimProduct”.

SQL Server Integration Services could now fulfill that exact requirement, but producing the more complex transformations to record attribute changes over time. 

 

image

Once the SCD Type 2 transform has been defined, it is important to re-map the Tabular model to use surrogate instead of natural keys as part of the relationship.  This can be simply done in the PowerPivot or Tabular model design interface:

image

generated SCD Type 2 dimensional schema

The result is a model that grew from simple file sources to a more sophisticated environment leveraging appropriate data transformations.

image

The process described is not a theoretical one but is based on real scenarios I have seen evolving without much effort, as each step fuels the project further. In my opinion, this makes up for great “BI”, as nurturing model growth allows us to stay relevant by developing close to the business, and in turn ensures accuracy and strong sponsorship on the projects implemented.

Advertisements