During the preparation for delivering a Power BI training session for a client, I was looking at the Tabular Data Model that was their data source, and I was struggling with a long running query. It was taking 1 minute and 24 seconds to return the values after chewing through 2 million rows of data. So I had a go at optimising it… and got it down to 6 seconds.
Wow! When the query ran in that short of time, I went ‘No way… that’s not right, what have I done wrong?’, but no, after checking it was right.
Let’s start at the beginning…
Number of Transactions:= CALCULATE ( DISTINCTCOUNT('Transactions'[TransactionKey]) , FILTER('Transactions','Transactions'[Valid From Date] <= LASTDATE('Transaction Calendar'[Calendar Date])) , FILTER('Transactions'[Valid To Date] >= LASTDATE('Transaction Calendar'[Calendar Date])) )
So here is the offending bit of code, basically doing a distinct count of transactions. In this case a transaction has a start and end date when it is valid, and we want to see the total number of transactions between a specific date range, set by a slicer/visual on the Power BI report.
So after the base line test, it took 1m24s to return the values. Once that baseline was established, I could see the improvement (or not) of any changes that I made.
So after looking through the Data Model, I noticed that the formatting of the date on the data table was set as dd/MM/yyyy and on the Transaction table it was set as custom. I set those to the same dd/MM/yyyy formats and reran the query.
This change dropped the time from 1m24s to 1m14s as it doesn’t have to covert formats between the two key columns in the background.
Next change I made was moving the separate filter statements into one by using ‘&&’.
Number of Transactions:= CALCULATE ( DISTINCTCOUNT('Transactions'[TransactionKey]) , FILTER('Transactions','Transactions'[Valid From Date] <=LASTDATE('Transaction Calendar'[Calendar Date]) && 'Transactions'[Valid To Date] >=LASTDATE('Transaction Calendar'[Calendar Date])) )
A small change, that took it from 1m14s to 1m4s, saving 10 more seconds. OK with a few small changes we’ve found a saving of 20 seconds, however the biggest bottleneck is the distinct count. If I change the DISTINCTCOUNT to a SUM, the Power BI Report returns values in mere seconds, but the totals don’t make sense in the context of the transactions.
So what to do about this bottle neck? Can we do DISTINCTCOUNT without doing DISTINCTCOUNT?
After hitting a search engine and consulting my worn copy of Power Pivot and Power BI by Rob Collie & Avichal Singh, yes there is another way of doing it.
SUMX comes to the rescue.
Number of Transactions:= CALCULATE ( SUMX( DISTINCT('Transactions'[TransactionKey]), 1) , FILTER('Transactions','Transactions'[Valid From Date] >=LASTDATE('Transaction Calendar'[Calendar Date]) && 'Transactions'[Valid To Date] <=LASTDATE('Transaction Calendar'[Calendar Date])) )
This change to SUMX is the big change in dropping the time down to less than 6 seconds.
How?
Well the SUMX and the rest of the ‘X’ functions SUMX, MINX, MAXX, AVERAGEX, COUNTX and COUNTAX are iterators. So it goes through each row in that table or table expression. In this case it is a table expression DISTINCT(‘Transactions'[TransactionKey]). Next for each of the rows in that returned table expression, it then evaluates the FILTER context. So in this case if the row is within the Valid From/To Date range, it returns a value of 1, then SUM’s all those 1’s up.
So it breaks down the DISTINCTCOUNT into distinct then count (well sum, but I hope you get what I mean).
Wow, massive performance boost! Time for a well earned cup of tea and a fancy biscuit, some thing with layers in it!
Wow, I have looked everywhere for a solution! This works great!
LikeLike
Wow, this is great! Thank you for sharing, this is definitely a great performance boost.
LikeLike