Denormalizing tables in Power Query

Given a great number of Power Query data load and transform scripts will end up populating an Excel Data Model, it is important to keep in mind that a dimensional approach is preferred when shaping data.  Clearly one of the steps we want to follow, whenever possible, is that of denormalizing tables to more properly reflect logical entities.

Using the Power Query UI, it is easy enough to load tables and merge them.  Unfortunately, relying solely on the UI can generate a rather inefficient process involving multiple M scripts.

Let’s say for example that we want to combine Adventure Works tables DimProduct, DimProductSubcategory and DimProductCategory to create a consolidated Product table containing a few product attributes as well as the subcategory and category names.

If you start querying table by table, you end up with three workbook queries.  As we are not yet ready for loading into the Data Model, I have unselected that option.

image

We still need two additional scripts to denormalize.  One will consolidate DimProduct into DimProductSubcategory (we can call the ouptut DimProductJoinDimProductSubcategory).  By selecting the Merge option in the toolbar, we are presented with a window that allows us to merge two resultsets by specifying a matching column.  In this case, we start by merging DimProduct with DimProductSubcategory via the ProductSubcategoryKey.

image

Unfortunately, as this window only allows us to merge two datasets at once, we have to walk one more step in order to merge DimProductJoinDimProductSubcategory with DimProductCategory and get all the necessary data we want in one table.  

image

As the result of this process will contain columns from all tables, we must remove the columns we don’t need.

image 

Finally we end up with the table we were looking for: “Product”:

image

Though it was fairly easy to do, its a relatively inefficient process that requires the refreshing of 4 intermediary scripts that add little to no value as independent queries:

image

Instead, we are better off generating the denormalizing effect as a single M script, which actually runs more efficiently as:

  • requires one query refresh instead of five.
  • selects only the required columns (improving efficiency), instead of selecting all columns and then removing the unwanted ones.

To do so, we can start instead with a blank query,

image

and on it, replicate the steps followed before but this time using systematic, step by step approach that consolidates all steps and does not require maintenance of separate scripts:

let
    Source = Sql.Database("localhost", "AdventureWorksDW2012"),
    DimProduct = Source{[Name="DimProduct"]}[Data],
    DimProduct.Result = Table.SelectColumns (DimProduct,
                {"ProductKey",
                "EnglishProductName",
                "Color",
                "ModelName",
                "ProductSubcategoryKey" } ),

    DimProductSubcategory = Source{[Name="DimProductSubcategory"]}[Data],
    DimProductSubcategory.Result = Table.SelectColumns ( DimProductSubcategory,
                {"EnglishProductSubcategoryName",
                "ProductSubcategoryKey",
                "ProductCategoryKey"}),

    DimProductCategory = Source{[Name="DimProductCategory"]}[Data],
    DimProductCategory.Result = Table.SelectColumns ( DimProductCategory,
                {"EnglishProductCategoryName",
                "ProductCategoryKey"}),

    /* Join DimProduct to DimProductSubcategory */
    DimProductJoinDimProductSubcategory = Table.Join(DimProduct.Result,
               "ProductSubcategoryKey", DimProductSubcategory.Result , "ProductSubcategoryKey" ) ,  

    /* Join Product and Subcategories to Categories */
    Product = Table.Join( DimProductJoinDimProductSubcategory, "ProductCategoryKey",
               DimProductCategory.Result, "ProductCategoryKey" ),

    /* Select Final Result */
    Product.Result = Table.SelectColumns (Product, {
        "ProductKey",
        "EnglishProductName",
        "EnglishProductSubcategoryName",
        "EnglishProductCategoryName",
        "Color",
        "ModelName" } )  

in
    Product.Result

 

The result is highly efficient and more easily maintained.  Another way in which this process can be done, if there are existing relationships among tables, is by the process of ‘expanding columns’. These are columns that are automatically loaded and represent a link to another table that has a relationship with the table we are loading. You can see them in the UI as:

image 

Following a similar approach to the one outlined before, we can device a single M script to do all the work.  This time, we will expand the columns without the need to explicitly join tables, achieving the same denormalized result:

let
    Source = Sql.Database("localhost", "AdventureWorksDW2012"),

    /* load DimProduct */
    qryProd = Source{[Name="DimProduct"]}[Data],

    /* retrieve Subcategory Name */
    qrySubCat = Table.ExpandRecordColumn(
        qryProd,
        "DimProductSubcategory",
        {"EnglishProductSubcategoryName", "DimProductCategory" }),

   /* retrieve Category Name */
    qryCat = Table.ExpandRecordColumn(
        qrySubCat,
        "DimProductCategory",
        {"EnglishProductCategoryName"}),

    /* select columns of interest */
    Product = Table.SelectColumns (qryCat , {
        "ProductKey",
        "EnglishProductName",
        "EnglishProductSubcategoryName",
        "EnglishProductCategoryName",
        "Color",
        "ModelName" } )   

in
    Product

 

The Table.ExpandRecordColumn function allows you to navigate pre-established relationships. Notice the first parameter is the table we are working with and the second parameter is the column we will be expanding. If we are in the ‘many’ side of the relationship, we will see a ‘Value’ string which indicates there is a single value to be returned from the lookup table.  Otherwise, we will see a ‘Table’ string.

Finally, notice that when expanding Subcategory values we not only retrieve the EnglishProductSubcategoryName (a column we want in the final resulset), but also we are retrieving a column called ‘DimProductCategory’.  This is a column with a reference to the DimProductCategory table, which is required to further expand the resultset, allowing us to jump one more step and retrieve the EnglishProductCategoryName.

About these ads

5 Responses to Denormalizing tables in Power Query

  1. Pingback: Power Query for Excel – on the rise | Microsoft Excel and Access Experts Blog

  2. Colin Banfield says:

    Hi Javier,

    “Given a great number of Power Query data load and transform scripts will end up populating an Excel Data Model…”

    The major issue is that currently, any Data Model loaded from PQ data is permanently confined to the desktop, i.e. cannot be published and refreshed in SharePoint. I consider this restriction a significant limitation, as the model cannot be easily shared. I’m interested in your thoughts on this issue.

    Given that the major goal of PQ is to clean and transform ‘source data’ (and not pristine data from a data warehouse),would it not be better to illustrate the denormalization process using the AdventureWorks database? Then the second method you describe (expanding table columns) would make use of the native PQ UI and thus be a straightforward self service solution.

    AdventureWorks DW is nothing more than a bad design with an unnecessary snowflake product dimension. You’re entirely correct that, if used a data source, this snowflake should be denormalized.

    • javierguillen says:

      Hi Colin! Thanks for your comments.

      Yes, PQ based models are currently bound to the desktop – excel PowerBI models of course. And given I heard all these functionality will make it back to on prem environments at some point, I think is worth studying it now…

      You are correct about a ‘true’ case for denormalization using AdventureWorks instead of AdventureWorks DW. However, as this first one is a bit less known I opted for the second option. I hope the point of the technique explained still comes through.

      Expanding columns can indeed be done using the native PQ UI but there is still a difference with the M approach outlined. By carefully selecting the columns to be ultimately displayed on the dataset (through Table.SelectColumns), performance can significantly improve. In comparison, the UI prefers selecting all columns (reflecting all fields plus relationship metadata) only with the option of subsequent removal of the undesired ones, which in my opinion leads to a non optimal process.

  3. Colin Banfield says:

    I think that the point of the technique explained still comes through. I gather that the UI approach was expressly designed for self-service users with no background using table joins and the like. It’s unclear though, whether these users would know to eliminate unnecessary columns without some guidance as provided in your post.

    • javierguillen says:

      Indeed; Both DAX and M are amazing languages but still yet to be seen if they will be fully adopted by the non technical analyst community. I am an optimist though! :)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 67 other followers

%d bloggers like this: