Converting a Flat Table to a Good Data Model in Power Query

Author:Murphy | View: 22496 | Time: 2025-03-22 23:38:46

When loading a wide Excel table into Power BI, we end up with a suboptimal data model. What can we do to create a good data model? What even is a "good" data model? Let's dig into it.

Introduction

Just another day at the Office: A Client called me to fix something on his Power Bi report. I looked at the data and found one wide table with 30, 40, or more columns.

I asked the natural question: "What is the source for this table?" The Answer was "Excel; what else?"

"Of course", I think.

My next question: "Can we build a good Data model out of it?"

My client: "Why?"

And here we are.

Ideally, I would import the source file into a relational database and convert the data into a nice Data model. Unfortunately, it is the exception that my clients want to pay for something that, at first glance, doesn't benefit them.

But why do I want to have a good data model? A flat table works well, isn't it?

Again, at first glance, yes. There is no problem with a few thousand rows.

But as soon as the amount of data increases, problems can build up.

Here is the short version of an answer to the "Why?" question:

So, why a good Data model?

There are a lot of reasons why I want to have a "Good data model".

The two short reasons are Efficiency and Usability.

When separating the types of data into separate tables and connecting them with Relationships, Power BI can work in a more efficient way.

Moreover, it is possible to reduce the size of the Power BI file by removing redundancies.

I strongly recommend reading the SQLBI article on this topic. You can find the link in the References section below to get a thorough answer to this question.

Second, someone who wants to consume the data model can find all columns (a.k.a. attributes) grouped in separate tables instead of searching them within a long, alphabetically sorted list.

So, now that we know why, the next question is, what is a good Data model?

The answer is: A Star Schema.

What the heck is a Star Schema?

You can jump to the next section if you already know what a Star Schema is.

My original Data model is the following:

Figure 1 – Original Data model with one table with 27 (!) columns (Figure by the Author)

The central table is the large one in the middle of the Diagram named "Azure_UsageDetails". Let's call it the "original table".

The other two tables are the Date table used for the reporting over time.

When you look closely at the original table, you can find some interesting columns, like:

BillingProfile
MeterName
SubscriptionName

The common thing with these columns is that while I have around 55'000 rows in the table, these columns have only a small number of different values.

This means that the Cardinality is low.

Moreover, these columns describe my data. They do not contain any value, like "Quantity" or "UnitPrice".

The target is to move these columns into separate tables, called Dimensions, to get shorter tables containing only the distinct values, or combination of values, from these columns.

Look at the following diagram:

Figure 2 – Target Data model as a Start Schema (Figure by the Author)

Now, you can understand why it is called a Star Schema.

We can call it a Squid Schema, but nobody will understand us.

As you can see, each Dimension table has an ID column connected to the corresponding ID column in the central table.

The ID columns should always be of Data type integer. Never Text columns.

The table at the center of the star schema is called a "Fact"-table. It only contains values for each event, transaction, or data point.

Let's start

OK, now we need to perform some preparation steps:

Find columns with a low cardinality.
Find columns that should be grouped.
Define names for each Dimension table.
Replace all empty cells.
Duplicate the table with the data.

First, open Power Query.

Second, we enable the profiling of the data:

Figure 3 – Enabling Data Profiling in Power Query (Figure by the Author)

Power Query shows the Profiling only for the first 1000 rows. It can be a good idea to enable the profiling of the entire data set.

On the bottom left corner, click on the text "Column profiling based on top 1000 rows" and switch to the option "Column profiling based on entire data set":

Figure 4 – Switch Profiling on the entire data set (Figure by the Author)

Depending on the amount of data, it will need some time to load the whole data set and calculate the profiling.

When I look at my data, I find the first three candidates here:

Figure 5 – First three Candidates for Dimensions (Figure by the Author)

Out of these, I can create three Dimensions:

BillingProfiles
Subscriptions
Meters

As you can see, I give plural names to the Dimension tables, each containing one or more instances of that entity.

Now, look at the columns starting with "Meter":

Figure 6 – All four Meter columns (Figure by the Author)

I can notice two crucial details:

The Cardinality of the four columns is different.
I have a lot of empty rows for the MeterRegion column (76 %).

Based on the Cardinality (Number of distinct values), I reckon I can build a Hierarchy for the Meters.

Top-Level: Meter Region
Second-Level: Meter Category
Third-Level: Meter Subcategory
Leaf: Meter Name

The data from the Fact table will have a Relationship to the Meter Name column. To say it more correctly, the ID column will be created based on the Leaf Level but from the Distinct combination of all four columns. The reason for the combination is that there is a probability that a Meter Name appears multiple times and is assigned to different Meter Subcategories.

Second, we must replace the empty rows in the Meter Region column with something meaningful to avoid having a hierarchy node at the top level called (Null).

To achieve this, I right-click on the column name and click on "Replace Value".

In the dialog for this function, I enter null as the Value to find and "Empty" as the value to replace:

Figure 7 – Replace empty rows with the Text "Empty" (Figure by the Author)

The result is the following, with no empty rows for the MeterName column:

Figure 8 – MeterRegion without Empty rows. But other columns still have blank rows. (Figure by the Author)

Next, I have to go through all columns intended to be Dimension columns and replace the blank values.

I find these columns by examining the "Empty" line in the Profiling Area (See the red marked lines).

You might use another word instead of "Empty" as the replacement text for the empty rows. For example, "None".

The next step is needed to avoid duplicate column names when building the Dimension tables:

I rename all columns ending with "Id" to "GUID" (Except the InvoiceID, as this column will stay where it is).

Next, as my data is cleaned up, I can Duplicate my table and start building my Dimension tables.

I right-click on the Azure_UsageDetail table and click on Duplicate:

Figure 9 – Duplicate the table (Figure by the Author)

Now, I rename the table to Azure_UsageDetail_Copy.

But I don't need this copy of the data in our Data model. So, I turn off the loading of the table:

Figure 10 – Disabling the load of the table into the Power BI data model (Figure by the Author)

This option has the effect that I can use this table as the source for all subsequent operations without having this table in Power BI.

Now, I can use the Copy of the original table to build my Dimension tables.

A simple Dimension table

The first Dimension will be for the Subscriptions.

I need the following steps:

Create the Table referencing the copied table.
Remove all other columns.
Remove all Duplicates.
Add an ID column.

After these steps, I must add the column to the original table:

Merge both tables.
Expand the ID column.
Remove the column for the Subscription from the original table.

Right-click on the copied table and click on Reference (See in the Reference section below for the difference between "Duplicate" and "Reference" a table).

Now, I must rename the table to Subscriptions (Double-Click on the table).

I can use either the "Choose columns" or the "Remove other Columns" feature to remove all but the two Subscription columns.

I select both columns (with Shift-click) and right-click:

Figure 11 – Remove all other columns after selecting both "Subscription" columns (Figure by the Author)

The next step is to remove all duplicates:

Figure 12 – Remove the Duplicates from the table (Figure by the Author)

Now, I have a small table with two columns and two rows.

To add an ID column, I use the Index feature:

Figure 13 – Add an Index column (Figure by the Author)

Then, I renamed the newly created "Index" column to "SubscriptionId".

Now, the Subscriptions Dimension table is completed.

Figure 14 – The finished Subscription Dimension table (Figure by the Author)

After this, I must replace the existing "Subscription" columns in the original table with the new ID column:

I use the "Merge Queries" feature to join the two tables together:

Figure 15 – Merge the new Dimension table with the original table (Figure by the Author)

It is essential to select both columns to ensure that the correct rows are assigned.

You can check the proper assignment in the information line at the bottom: When both numbers are identical, all is OK.

To get the Id column, I must expand the merged table:

Figure 16 – Expand the merged table and include the Id column only without the original column name a s prefix (Figure by the Author)

As a result, I get an additional column with the matching SubscriptionId.

I repeat the same steps for all "simple" Dimension tables.

But sometimes, I need to add more steps when building a Dimension table.

Something more complex

During the conversion of the Data model, I noticed that a bunch of columns belong together:

All "Billing" columns
ChargeType
Frequency
PublisherType
PricingModel
InvoiceNo

All these columns are tied to the invoicing topic.

Therefore, I decided to group them into one Dimension table.

The first steps are the same as above:

Create a referenced table.
Remove all unneeded columns.
Add the Index column for the ID.
Rename the columns to be more user-friendly.

But then I noticed that I can use the Billing Period column to extract the Billing Month and then remove this column. This will have benefits for my Reporting.

So, I added a Custom column with the following formula:

Date.MonthName([Billing Period]) & "-" & Text.From(Date.Year([Billing Period]))

This is the result:

Figure 17 – Result for Billing Month (Figure by the Author)

To ensure that this column can be sorted according to the month number, I added one further Custom column:

(Date.Year([Billing Period])*100) + Date.Month([Billing Period])

I can use the Sort by Column feature in Power BI to sort the Billing Month column by this new BillingMonthNum column.

The next step is to set the correct Data type (Text for the Billing Month and "Whole Number" for BillingMonthNum column).

Another way to add a sorting column is through Ranking.

For example, look at the Invoice No column:

Figure 18 – Extract from the "Invoice No" column (Figure by the Author)

Imagine that we want to add a sorting mechanism to this column.

An Index column will not work, as we would get a continuous number regardless of the content.

But, as you can see, we have several rows with null as Invoice No.

Therefore, using the Index to sort this column in Power BI is impossible, as we would have the same content but with different numbers.

We can solve this by using the following expression for a new Custom column:

Table.AddRankColumn(#"Changed Type","InvoiceNoRank", "Invoice No",[RankKind=RankKind.Dense])

"Changed Type" is the name of the previous step (Input table).
"InvoiceNoRank" is the Name of the new column.
"Invoice No" is the column over which the Ranking will be calculated.
The last parameter [RankKind=RankKind.Dense] is the most important.

With [RankKind=RankKind.Dense] I tell the function that rows with the same invoice Number must receive the same ranking and that the number range must be continuous.

You can get more details about this function here:

Adding Pre-Calculated Rank in Power BI Using Power Query

The resulting column looks like this:

Figure 19 – Result with Ranking column (Figure by the Author)

You can see that all rows with an empty Invoice No get the Ranking 1. The subsequent invoice numbers are 2, 3, 4, etc.

When merging this table to the fact table, I select all columns that can be used to identify each row uniquely.

In this case, using the columns BillingPeriods and Invoice No is enough.

Last Steps – Complete the Data model

Finally, I can complete the Data model.

But before I jump to Power BI, I remove all the obsolete columns from the Original table in Power Query and rename the Table to "FactAzureUsageDetails".

After loading the data into Power BI, I must check the automatically created Relationships to ensure all Relationships are between the ID columns. Sometimes, Power BI creates Relationships between the Name column, which is suboptimal.

As I already created a report based on the old data model, I must fix all the visuals and measures, which used the columns now moved to the dimension tables.

Final Result

This is the final Data model:

Figure 20 – Final Data model as a clean Star Schema (Figure by the Author)

This is precisely what I wanted to achieve.

Interestingly, the new pbix file is now larger than the original file. But I talk about 800 kB vs 750 kB before. The raw data is around 20 MB in size.

I already made such a change, and the resulting pbix file was way smaller than before.

I think I have such a small volume of data that the additional complexity causes the pbix file to grow.

Conclusion

The effort to modify my Data model was not small. But it was not nightmarishly high. In any case, it is worth exploring this approach to have a better solution in Power BI instead of having a table with dozens of columns, which are very hard to navigate.

Imagine you can group all columns for one topic into one Dimension table. Even when these columns are sorted alphabetically, you can easily find them. Compare this to when those columns are all in the original table.

In my opinion, this topic should deserve more attention. I even see solutions in Power BI built by colleagues of mine using very wide tables. My first comment is, "Why do you work with such a table? Why don't you have a Star Schema in your Data model?".

And the answer is always: "Because the data comes in this form and I don't have time to modify the structure".

But as soon as performance issues start to appear, my first step is to analyze and optimize the Data model.

I recommend using DAX Studio with VertiPaq Analyzer (SQLBI Tools Page) to get Statistics about the Data model and find potential issues.

You can get a short introduction on how to use this combination here:

How to use Vertipaq Analyzer with DAX Studio for Power BI Model analysis – FourMoo | Power BI |…

Or search for "How to use VertiPaq Analyzer" in your favorite search engine.

I'm thinking about writing an article on this topic soon. Please let me know in the comments if you are interested in such a piece.

References

The Data is from my private Azure Subscription. I download the monthly consumption from the Azure Portal and import the data into an Azure SQL database.

An SQLBI article on why we should have a Star Schema as a Data model: