Azure Databricks and CosmosDB: Tips on playing nice together

DBrick love Cosmos

I’ve been working on a project and been using CosmosDB as a data source and extracting the data from it into Azure Databricks. Once there I apply some data and analytics magic to it. So this post is about some of the assumptions that I had about it, and some of the issues that I came across.

For setting up Databricks to get data from CosmosDB, the place to go is the Azure CosmosDB Spark connector site. I’m not going to go through installing it, as the read me and guidance on the GitHub does a good job and it is straight forward to do. There is also a tutorial here, that deals with ‘On time flight performance’, it covers how to connect, and then process data but doesn’t cover any of the issues you may have, as its using a nice, conformed data structure. In the real world, we have all sorts of weird design options, taken over the life time of a product and/or service.Read More »

Theme ggplot2 to Power BIs visual style

For a recent project, I’ve had to hit the R scripting and use the R Visuals to plug a gap in Power BI (PBI). Even though PBI is very capable, it does not have the full range of statistical formulas that Excel has. So I’ve had to build Linear Regression formulas in DAX, and also calculate in Power Query some coefficients using R. For the visuals, again I hit some limits with Power BI. I needed to use a measure as an axis, I also needed to show a polynomial trend line, so had to use the R visual and the ggplot2 library to display the data.

I’ve not used R much, I’ve been on a SQL Bits training day about it, and one of my colleagues, was quite good at it (However they have since moved on), so it was a nice move out of my comfort zone and I get to learn something too!

Note: If you want to follow this blog post you’ll need Microsoft R Open and install the ggplot2 library. Also useful is the ggplot2 reference web site

In this example I’ve started with a blank PBI file, and used the ‘Enter Data’ function to create a column with single value in it. You don’t need to do this if you are using you own data, I just need something to drag into the R visual to use as a data set. I’ll actually be using one of the inbuilt R data sets for the visual. You can download the example PBI file from my GitHub

So lets start with the basic set up in the R Visual.

library(ggplot2)

#Base Chart uses the iris dataset
chart <-    ggplot(iris, aes(x = Petal.Width, y = Sepal.Length)) + 
            geom_point() + 
            stat_smooth()      

#display
chart

Which renders the following chart. (Note the rounded corners in the visual, thanks to the Feb PBI Desktop update)

Lets break this down. First it calls ggplot, using the ‘iris’ data set and assigns the Petal.Width and Sepal.Length to the relevant axis.

ggplot(iris, aes(x = Petal.Width, y = Sepal.Length))

Plots the points on the chart, don’t miss this bit out, I did, and could not understand why I wasn’t seeing any data

geom_point()

Add the trend line and the shading.

stat_smooth()

So far so good. But it does not fit the Power BI style, and will look a bit out of place along side the base PBI ones. However we can fix that. First we are going to add two variables called BaseColour and LightGrey and assign them a hex colour value so they can be used without having to recall the hex values. So the base code will look like:

library(ggplot2)

#BaseGrey
BaseColour = "#777777"
LightGrey = "#01B8AA"

#Base Chart
chart <-    ggplot(iris, aes(x = Petal.Width, y = Sepal.Length)) + 
            geom_point() + 
            stat_smooth()      

#display
chart

Nothing should change in the visual. So next I’m going to update the stat_smooth function, to remove the shaded area and change the colour of the line.

stat_smooth(col = LightGrey, se=FALSE)

‘col’ assigns the colour, ‘se’ removes the shaded area.

For the next set of updates to the visual, we are going to update the theme setting, specifically the following:

  • Axis Text
  • Axis Ticks
  • Panel Grid
  • Panel Background

So lets start with the text. if you poke around the PBI formatting settings you’ll come across the default font family, size and colour in the standard PBI visuals. So next we are going to add the following:

First set the ‘axis.text’ elements to the size, font and use the variable ‘BaseColour’

axis.text = element_text(size=11, family=”Segoe UI”, colour=BaseColour)

remove the axis ticks with

axis.ticks = element_blank()

and set the axis text, setting the colour using the BaseColour variable

axis.title = element_text(size=11, family=”Segoe UI”, colour=BaseColour)

but will need to wrap it up in the ‘theme’ stuff, so the code now looks like:

library(ggplot2)

#BaseGrey
BaseColour = "#777777"
LightGrey = "#01B8AA"

chart <-    ggplot(iris, aes(x = Petal.Width, y = Sepal.Length)) + 
            geom_point() + 
            stat_smooth(col = LightGrey, se=FALSE) 

#Build theme
chart <- chart + theme(      axis.text = element_text(size=11, family="Segoe UI", colour=BaseColour)
                         ,   axis.ticks = element_blank()
                         ,   axis.title = element_text(size=11, family="Segoe UI", colour=BaseColour)     
                        )

#display
chart

Which hopefully should be getting close to the PBI look:

So just the grid line and backdrop to sort out, which will be added to the theme set up as:

Set the grid for the ‘y’ axis.

panel.grid.major.y = element_line( size=.1, color=BaseColour )

Set the grid for the ‘x’ axis, basically get rid of it using the element_blank()

panel.grid.major.x = element_blank()

And set the backdrop to blank as well

panel.background = element_blank()

which should mean that the code should now look like:

library(ggplot2)

#BaseGrey
BaseColour = "#777777"
LightGrey = "#01B8AA"

chart <-    ggplot(iris, aes(x = Petal.Width, y = Sepal.Length)) + 
            geom_point() + 
            stat_smooth(col = LightGrey, se=FALSE) 

#Build theme
chart <- chart + theme(      axis.text = element_text(size=11, family="Segoe UI", colour=BaseColour)
                         ,   axis.ticks = element_blank()
                         ,   axis.title = element_text(size=11, family="Segoe UI", colour=BaseColour)     
                         ,   panel.grid.major.y = element_line( size=.1, color=BaseColour )  
                         ,   panel.grid.major.x = element_blank()   
                         ,   panel.background = element_blank()  
                        )

#display
chart

And it should match the look of the default Power BI style.

So with a few bits of code added to the theme, you can change the look of an R visual, to match the default PBI theme, or what every you want it to look like. I’m not a R expert in any measure, and there maybe a better way of doing this, but its got me started, and hopefully you too.

Update: The BBC released a theme for R, along with a cook book on how to do visuals, I may use this as a base for my next project.

Power Query – Combining Columns in M

In my Power BI courses I always recommend some books and sites that will help them learn DAX or M. Most of the time i’m presenting to new users to Power BI, and having to get them to move from the Excel muscle memory that they have and also to show them that sometimes the code-less approach is not always the best way.

One example presented itself the other day. A user was combining some Excel sheets and wanted to create a rough dimension based on the values in one of the columns, then use it as a filter or axis. They hit the internet and came across a few post on how to do this. A few recommend this approach:

1 – Append the two tables together as new query
2 – Remove the columns you don’t need
3 – Remove duplicates

Well that is sort of fine, as you can’t specify a single column in the table append, and it can all be done in the interface. However is has a bit of overhead in the 1 and 2 stages, so can you do it so you only reference the relevant columns and remove some of the overhead. The answer is ‘Yes’, but you need to code it in M. Using the Blank Query function i created this M code:

let
   Source = Table.Combine(
                            {
                                Table.SelectColumns(#"table name 1",{"column name"})
                            ,   Table.SelectColumns(#"table name 2",{"column name"})
                            ,   Table.SelectColumns(#"table name 3",{"column name"})
                            }
                        ),
    #"Removed Duplicates" = Table.Distinct(Source)
in
    #"Removed Duplicates"

So lets look at the M code behind it.
So rather that use a query as a reference, you can declare the table and column that you want with the following

 Table.SelectColumns(#"table name 1",{"column name"})

You wrap these up with a table combine to a SQL Union the columns together using the following

Table.Combine

than run a remove duplicates to do a SQL Union All. You don’t have to duplicate the whole tables/queries just to select the data, then trim it down to the columns that you want. As far as I can see you can’t do this in the interface, it only via M.

Power BI – Parameters & Data Sources

This blog post comes from a question raised at a Power BI training session, and deals with updating data source information, for example moving from development to production, or folder or file locations moving. The question was asked as I was showing the attendees the ‘Load from Folder’ function in Power BI, so you can dump monthly files into a folders and load them together. The question was  ‘What if the location changes? do you have to rebuild the report?’. I mentioned that you can edit the M code to change the location, but the users was a bit worried that it was a bit to ‘techy’ to do that, so  I suggested a parameter that you can use. They where more than happy with that approach it  turns out.

So how to use parameters? Well lets fire up Power BI and enter Power Query, but clicking on ‘Edit Queries’

And then select the New Parameter option, which should bring up the following…

Now, there are a few options here, you can have a list of values, or items based on a query, but for this example, we just need the basic version.

Type = Text
Current Values = The folder location that we are going to use to pull the files in

Click OK and how do we use it when loading a folder?

Select the folder import then select the little drop down for the folder path.

Select the parameter (Or start the process from there, by creating a new one).

Click OK, and away it goes as if you typed in the folder location normally.

So if you need to change the location, you don’t have to edit the M code, just update the parameter, by clicking on the manage parameter selection, where we created the new parameter. Nice and easy, well I think so.

So parameters can be used in a number of data sources, for example SQL Server connections

So switching around DEV, UAT and Live servers should be a slightly less painful process, than editing the M code.

Power BI – How to clear all slicers

ClearAllBookMark

I was delivering my standard Power BI training course, and there was a question from one of the users which was ‘Is there a way of clearing all the slicers you have selected, so if you are doing something you can go back to the start’, my answer was ‘No… wait, yes… you could use a bookmark to do it’

So how to clear all slicers? First of all in Power BI Desktop you need to go to view and add the Bookmarks Pane. Then make sure that all the slicers are clear, or in the default state that you want them. In this case I made sure that everything was deselected. Once that is done, Click ‘Add’ on the bookmarks pane, that will create the Bookmark, I’ve renamed mine to ‘Default’

BookmarksPane

Next, add a button to your report, and allocate the ‘Default’ Bookmark to the Action setting. I’ve selected the ‘Reset’ button and added the text ‘Click to reset’ on mine. Next select some items, once you are done and want to go back to the start, click on the button (Ctrl-Click in Power BI Desktop), that will apply the default Bookmark, and in my case clear all the slicers! Nice quick and easy solution.

 

 

 

 

Your Brain & Report Design Part 2 – Reports and Cognitive Burden

Not done a blog post for a few months, sorry work getting in the way a bit.

In Part One I talked about the Intuitive (System 1) and the Attentive (System 2) processes, and the separation and focus of these mental processes. Here is a quick reminder of the functions of both parts.

System 1 (Intuitive) System 2 (Attentive)
Fast Slow
Unconscious Conscious
Automatic Effortful
Everyday Decisions Complex Decisions
Error Prone Reliable

Cognitive Burden

In an number of texts about report design, the phase Cognitive Burden is used. With the outline of System 1 and 2, cognitive burden is impact of switching between the two functions and making it harder for the System 2 Attentive process to work out what is going on. So in some regards you can use bits of System 1 to help with System 2, and also focus System 2 get to the underlying understanding of the information a bit quicker.

The main outline of this approach is that of simplification… reduce distraction, reduce clutter, remove anything that is not needed, and this is done in a number of ways:

  • Layout
  • Ordering
  • Colours
  • Visualisation Selection

Also to consider will be ‘Framing the Answer’, how best to present a value, with less abstraction of the how you view it, (which sounds trickier than it is) which I’ll cover  in another post. So in this post I’ll be covering the first two, Layout and Ordering.

Layout – Arranging the space

How to leverage System 1 in layout of reports? Well you use some thing that is easy, intuitive and you are doing it right now… reading. The process of reading is from left to right, top to bottom. However this is approach is biased to the English language in my case, so if you are creating reports in Japanese, you’ll be doing it right to left.

When designing reports, I don’t tend to add the logo of the company that it is for. This is purely a personal approach, I don’t see the need to remind the reader of the company that they work for and it also takes up space on the report, that could be best served for information. So if I do have to add a logo or other graphics or even a title, I tend to put them in the top right. Some times if I can get away with it, the logo goes down the bottom right, nicely out of the way.

So in layout terms I would place what you want the viewer to see first on the left side or in the top left, so they get that first information hit, if it is a particularity important value or report that you want them to see first.

Stop and Search - Power BI

In the above case the three card visuals are the first thing I want the viewer to see, then the Map to supply context, then the rest of the charts.

The next technique I use is to layer the report into three sections based on the granularity of the data, so cards would be at the top, high level data, then charts, then table data.

Stop and Search 2 - Power BI

So as the user moves down the report there is a reveal of information, and you can see the break down of the details as you move though the report.

So, in the above reports, you have used the learned System 1 process of reading and used it on a report. Now you can layout a report, next I’ll be talking about the importance of selecting the right visual.

 

Your Brain & Report Design Part 1 – The Attentive vs the Intuitive Process

image11I do a number of Power BI workshops, were over a few days I guide people through the features of Power BI, what it is good at, what its limitations are, and after a training exercise on a basic data set to familiarise the clients with Power BI Desktop, we get started on report building with their own data. In that phase of the workshop, it is always interesting to see what they build, particularly how they lay it out and what colours they use.

After some, lets say, very interesting design choices, I started adding in a section into the workshop slides on report design best practice, mainly based on the guidance in the book ‘Information Dashboard Design by Stephen Few‘ and ‘Storytelling with Data by Cole Knaflic‘. In these books, and a few others, it talks about the ‘attentive process’, and how by incorrectly choosing visuals you make the information presented hidden, obscured or you just can’t see the pattern.Read More »