The trouble with twins

Sadly there hasn’t been much love shown to Data Quality Services (DQS) in the last few releases of SQL Server, and I don’t think there will be in the coming SQL Server vNext release.

DQS is slow, it’s cumbersome, it’s interface is terrible, it’s does not scale well and takes a number of headache inducing workarounds to get it to a level of decent performance.

However it works, and can work well when it does. However as with all things data when it interacts with people it hurts. Some background to the project, DQS is matching people from separate data sources, each without a common identifier or reference across the systems. So we running some matching to on the data. We can add synonyms, so names like David, can be referred to as Dave, or John, Jon and Jonathan, so in the event of one system using one name it will match with the data from the other. 
So in match we can get a level of confidence, expressed in a percentage, in fact the process uses not just the name but a bunch of other factors, date of birth, address and so on. But one thing it breaks down on is a certain type of twins.
What sort? twins with identical, and nearly identical names. 
Identically named twins, surely no one does that? Sorry yes they do. I have encountered and confirmed that there (At least in one area 4 sets) twins named the same, without a middle name to differentiate them.
As for nearly identical names, again yes, for example Jane and Jade, only one letter different. So when matching them it gets a high level of success, in the high 90’s. What to do? The confidence level can’t be changed as it will ignore or create people when it should or shouldn’t.
Ahhhh people!

I want to dislike something!

Update: Since this post was created back in 2014, Facebook have added some new ‘Like’ options

Businesses need data. Data becomes information, and from that decisions are made. Depending on the data that you have will you make a good or bad decision?
In November 2012 a particular brand got over 61 million ‘Likes’, but as high as that figure is, what information and data can we get from it, and can we make an informed decision from it?
Social media platforms can be an effective way of getting feedback, marketing and advertising, but like most of those tools it is only as effective as the way it is used, the assumptions that you make and the questions that you ask. We also have to remember that it is only in the last 6 years that Twitter was founded and Facebook opened up to everyone and the now ubiquitous ‘Like’ button has only been around since April 2010.
The effect of social media on marketing and user engagement has profoundly altered the landscape between the corporation/company and the consumer. Feedback can be virtually instantaneous and hard for the company to manage. This year the Olympic coverage by NBC in America was advertised as ‘good as being there’ but it decried by those watching it, with time delays, little live coverage and pre-recorded segments. The Twitter hash tag #nbcfail went viral, used not just to refer to the Olympics, but other programs on its network and its brand was tarnished. Also in 2012 the short film ‘Kony 2012’ also went viral but in a more positive way and raised the awareness of the child soldiers in a militia lead by Joseph Kony in pats of Uganda, Congo and Sudan.
There have been many high profile incidences of this sort of positive and negative viral feedback, made possible by the ease of use of social networks ability to communicate among its users, but it is also the ease of use that is its fundamental weakness.
As simple test also indicates something strangely amazing …that actual engagement depends on the question, and looks for simple answers to simple questions. In a completely un-scientific test one of our employees posted the following two status updates within seconds of each other on their Facebook profile:

1: What shall we do about the gap between the rich and poor in the world today???
2: Someone just brought in Crispy Cream donuts into the office…Get in!!!!

Which got the first reply? Answer: 2 (with in 5 minutes of posting)
Which got the most likes? Answer: 2
Which got the most comments? Answer: 2

But to be fair, is this a judgement of the subject matter or just their friends?
Using the ‘Like’ button as an example, it is far too easy for the user to click it, and not a good measure of user engagement. Charities in particular have noticed this, it can get a large number of ‘Likes’ for a campaign, but when averaging out the donations to ‘Like’ ratio it can be as low as 10p per like. Data from the ‘Just Giving’ web site, has found that over a 17 day period they found that about 6% of visitors that came to the site from Facebook ‘Like’ links actually donated. It indicates that it is easy to push the button like, but not to engage. Using the ‘Kony 2012’ example, how many people today can say what happened after they clicked ‘Like’
But back to the original statistic at the start… in November 2012 a particular brand got over 61 million ‘Likes’.
What information can that tell us? In this case nothing of real value. You may have figured out the 61 million ‘Likes’ is referring to the 2012 United States Presidential election. It is only when you consider that Barack Obama got 61,170,405 votes (50.5% share) to Mitt Romney’s 58,163,977 (48% share) can you actually make an informed opinion. In this case the population of America is very much polarised politically and no matter who won would, roughly half the people would not like the winner. In effect 58,163,977 people pushed the ‘Dislike’ button for Obama.
It gets even harder when you look at the effectiveness of the marketing campaign. The campaign to re-elect Obama, cost about $930,000,000, or about $15 per vote. Or was it? We have to remember that the most important people were the undecided voters. Those that would have voted Democrat or Republican anyway do not count in the calculation as their votes are given, and no positive or negative marketing would have made an impact on them. In America for the 2012 election about 23% of registered voters classed themselves as Swing/Floating voters. With 39% committed Democrats and 37% committed Republicans as mentioned before who will vote for their party regardless. So to get from 39% of voters to 50.5% actually cost the Obama campaign just over $300 per vote. A back of the envelope calculation of the combined Democratic and Republican campaigns results in a total spend of $639 for each swing voter. These figures are a very rough approximation of the data, but it indicates that in the whole population of voters, only 23% can be classed as actively engaged.
So we can draw a rough idea of the interactions of the user:

1. Expect a low engagement rate
2. Only expect simple answers to simple questions
3. Target your market selectively

Current customer and market research by Adobe has highlighted an interesting feature request. Users want a ‘Dislike’ button. Why? Maybe because the user wants a choice, they want to be able to say ‘No’ this doesn’t appeal to me, maybe just having a ‘Like’ button is a bit corporate sycophantic, to many ‘Yes’ men driving the functions of the business to disaster. How many people would have pushed the ‘Like’ button if Captain Smith of the Titanic had posted ‘Just told the engine room full steam ahead….icebergs be damned!’
So we can add the following to the list

4. Users would like the choice to ‘Like’ things or not

Currently companies such as Starbucks & Amazon are having a bit of an image issue with the British public due to their low payment of corporation tax. What would be interesting to see is a comparison of the ‘Likes’ over this period. Have they gone down? Have they stayed the same? What would also be an interesting thought experiment is to see what would have happened if there was a ‘Dislike’ button. Would a user be more likely to press this option, rather than not respond? If we are measuring results and success accurately then companies need the fullest information that we can get, ‘Dislike’ may help provide that.

MDX and Sum columns

To be honest this post is just to remind myself on how to do stuff!
I was trying to work out the through put of people going through a stages of a program and got stuck on how to SUM a column in MDX.

Current Program
Next Program
No of People
Sum of People
Percentage To Next Stage
Program 1
Program 2
152
310
49.0%
Program 1
Program 3
68
310
21.9%
Program 1
Program 4
47
310
15.2%
Program 1
Program 5
33
310
10.6%
Program 1
Program 6
10
310
3.2%
 
Total
310
 N/A
100.0%
This is what i was looking for, to see the %age of people moving to the next stages. How did I get the Sum of ‘No of People’? Well I used the MDX SUM function with a AXIS function thrown in as well.
MEMBER [Measures].[Sum of People] AS SUM(AXIS(1), [Measures].[No of People])
So the ‘Percentage To Next Stage’ was easy
MEMBER [Measures].[Percentage To Next Stage] AS [Measures].[No of People] / [Measures].[Sum of People], FORMAT_STRING = ‘Percent’

SSIS – XML, Foreach Loop and sub folders

SSIS normally has out of the box most things you need to get the job done, however sometime a client comes along with a data source that requires a bit of out of the box thinking. In this case a it was a load of ‘csv’ files in a folder, with over 30 sub-folders, each with a file name which turned out to be a time stamp, that contains a snapshot of their infrastructure, The requirements also needed for it to scale, so the client could add other folders with the same sub folder structure for other parts of their infrastructure. So you would have some thing like this:
ABCD
Network
12345678.csv
Space
12345678.csv
Sessions
12345678.csv
And so on for 32 other folders
Right not a problem, I just set up an SSIS For Each loop container click on ‘Traverse subfolders’ option and then set out the data connections.

No, sadly the ‘Traverse sub folder doesn’t work like that. Nuts, so after a quick search I found this link at Joost van Rossum’s blog that uses a bit of C# to get the folder list and generate an XML schema with the folder list in it. You can then use the ‘Foreach NodeList Enumerator’ in the Foreach Loop to get the files.

Well sort of, the code on that website only gets the folder structure, not the full path of the file. It was however a good starting point, and looked like it could be adapted to get the full file list, that could be passed on to the data flow logic in the Foreach Container. Now I’m TSQL through and through, my C# is poor, but slowly getting better, however I did mange to hack the code so it got the file list. So here it is, please use, improve and share.

#region Namespaces
using System;
using System.Data;
using System.IO;
using System.Xml;
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;
using System.Collections.Generic;
#endregion

namespace ST_b87894259c434eeca3da339009a06fdf
{
///
/// ScriptMain is the entry point class of the script. Do not change the name, attributes,
/// or parent of this class.
///


// Use this for SQL Server 2012 and above
[Microsoft.SqlServer.Dts.Tasks.ScriptTask.SSISScriptTaskEntryPointAttribute]
// Use the below for SQL Server 2008, comment out the above
// [System.AddIn.AddIn("ScriptMain", Version = "1.0", Publisher = "", Description = "")]
public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
{

#region VSTA generated code
enum ScriptResults
{
Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success,
Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
};
#endregion

// Variables for the xml string
private XmlDocument xmldoc;
private XmlElement xmlRootElem;

public void Main()
{
// Inialize XMLdoc
xmldoc = new XmlDocument();

// Add the root element:
xmlRootElem = xmldoc.CreateElement("", "ROOT", "");

// Add Subfolders as Child elements to the root element
GetSubFolders(Dts.Variables["User::FeedsFilePath"].Value.ToString());

// Add root element to XMLdoc
xmldoc.AppendChild(xmlRootElem);

// Fill SSIS variable with XMLdoc
Dts.Variables["xmldoc"].Value = xmldoc.InnerXml.ToString();

Dts.TaskResult = (int)ScriptResults.Success;
}

// Recursive method that loops through subfolders
private void GetSubFolders(String parentFolder)
{
// Get subfolders of the parent folder
string[] subFolders = Directory.GetDirectories(parentFolder);

var allfiles = DirSearch(parentFolder);

foreach (var filePath in allfiles)
{
XmlElement xmlChildElem;
XmlText xmltext;

// var directoryInfo = new DirectoryInfo(Path.GetDirectoryName(filePath));

// Create child element "Folder":
// d:\foreachfoldertest\subfolder1\
xmlChildElem = xmldoc.CreateElement("", "File", "");
xmltext = xmldoc.CreateTextNode(filePath);
xmlChildElem.AppendChild(xmltext);

// Add child element to root element
xmlRootElem.AppendChild(xmlChildElem);
}

}
// This bit gets the file list and adds it to the subfolders
private List DirSearch(string sDir)
{
List files = new List();

foreach (string f in Directory.GetFiles(sDir))
{
files.Add(f);
}
foreach (string d in Directory.GetDirectories(sDir))
{
files.AddRange(DirSearch(d));
}
return files;
}
}
}
So that code now gets the full path for the file and adds it to the XML schema so you will end up with something like this:
E:\SomeFolder\Network\12345678.csv
E:\SomeFolder\Space\12345678.csv
D:\SomeFolder\Sessions\12345678.csv
That gets passed to the Foreach Loop. So now i have this list of files, how do I load them. As a default, each file has a data connection set up based on the example file the client submitted, once the columns and metadata was set up, the connection string could be moved to an expression based on the XML.
So the over all process looks like this, don’t worry it looks terrifying, but it is quite simple!

Here’s a bit of detail around the process
(1) – The C# script is run to feed the Foreach container
(2) – EXP – Get Feeds Filename. This takes the file path ‘E:\SomeFolder\Network\12345678.csv’ and removes the ‘E:\SomeFolder\’ just to return the ‘Network\12345678.csv’ part, as this is used in later steps to move the file to a completed or failed folder, both of which are set as project variables, so we just append the ‘Network\12345678.csv’ part with an expression to correctly move the file to the correct folder, for example ‘E:\Somefolder\Processed\Network\12345678.csv’.
(3) To correctly process the right file to the right data flow task an expression to get the right folder (FINDSTRING(@[User::SourceFolder] , “\\Network\\”, 1) > 0) and precedence constraint (Success) is used to trigger the correct data flow, and of course the data flow data source connection is controlled by the User::SourceFolder variable. The last two steps are as mentioned a move the completed csv file to a process or failed folder, using a Precedence Constraint set to a logical ‘Or’ so after each data flow it moves the file to the right folder. The process will loop and load until it reaches the last one in the XML and then stop. Its a lot more manageable than creating 35 individual packages to load the data and get around the ‘Traverse Subfolders’ issue. After testing It works very well, and the client can add folders with t he same folder structure for other parts of their infrastructure, and the process can accommodate them. The only pain was setting up all the precedance flows!

Deploying SSIS Packages SQL Server 2008

Bit old school, having been spoilt by SQL Server 2012 and the changes to SSIS that brought with it. Deploying stuff in Business Intelligence Design Studio (BIDS) is less user friendly than I recall. Had to deploy about 20 SSIS packages to the SSIS instance, and the import dialog using SSMS is OK for one or two packages, but i wanted to do them all with out repetition. So it was time for old school command line stuff using DTUTIL, which is still around even in the coming SQL Server 2016.
for %I in (*.dtsx) do dtutil /FILE “%I” /COPY SQL;”/SSIS_Folder/%~nI” /DESTSERVER localhost
In this example i’ve already created a folder in the localhost/Stored Packages/MSDB location called ‘SSIS_Folder’, and running a cmd window from the SSIS Package location. Lets break it down:
This bit just loops though all items in the folder that have a ‘.dtsx’ extension, good old ‘for’. ‘in’ & ‘do’. ‘%I’ is just the variable that holds the file name.
for %I in (*.dtsx) do
The next bit is the good stuff, this calls dtutil and copies the package using the variable ‘%I’ to the folder SSIS_Folder to the destination server ‘localhost’
dtutil /FILE “%I” /COPY SQL;”/SSIS_Folder/%~nI” /DESTSERVER localhost
Nice and easy, completes in seconds. If you need to deploy the same files again you can run the command again and it will prompt you to overwrite with a Y/N prompt, to suppress this prompt and automatically overwrite the package add ‘/Q’ (For Quiet) at the end of the command.

Long live the PC

There has been a lot of talk recently of it now being the ‘Post PC’ era, but is it true, is the PC dead? And how does the current move away from PC’s to Tablet have anything in common with the Galapagos Islands?  That last question may be a little bit strange, but it does have something in common as you will see.
The end of the PC is like one of those conspiracy theories about the end of the world, always upon us or just around the corner but never actually happens. Back in the early 2000’s it was imaged that the house PC would be this small black humming box that sits under the stairs and connects to everything. The buzz word was at the time was ‘Convergence’, which was basically your PC, DVD, TV, Game Consoles, Phone & Internet would all be replaced with some sort of super device. But that hasn’t happened as each maker wanted it to ’Converge’ on their device
which has led to PC’s with TV tuners and TV’s with internet access. A couple of things stopped ‘Convergence’, lack of connection standards and agreement between companies. Content was restricted as media creators and distributors refused to release items in different formats to keep costs down, also wanting to supply their content through their own portals. This is changing now with Hulu, Netflix and iTunes providing easy accessible portals across a wide range of devices.
If we look at the basic design of the PC as an example of ‘Monitor, Keyboard, Mouse’ the form factor has tried to change over the past 20 years but fundamentally stayed the same. The movement from the Desktop to the Laptop, kept the same format. The movement to from Laptops to Netbooks was the same, nothing essentially changed with the base configuration. Sure the mouse got changed to a touch-pad pointing device but nothing really moved forward, it just got smaller over the years.
How does this relate to the Galapagos Islands? Well like the PC form factor, the animals such as the Galapagos Finch adapted from one form and evolved into many variants, each exploiting the particulars of their environment.
The IT Marketing people have appropriated the word ‘Ecosystem’ from the Ecologists, but missed other important words they use……‘Niche. In the Ecological and evolutionary world, one size does not fit all as Darwin found with the Finch’s and Tortoises on the Galapagos Islands. They all evolved from a common ancestor and have over time radiated out into distinct types, and in the current PC market this expansion of devices into ‘Niches’ explains a lot. Netbooks are not expected to be made in 2014 as Tablet sales such as the Surface/iPad/Nexus have destroyed the ‘Small & Portable’ Niche that it once held; they have been out evolved and out competed into extinction.
So tablets have successfully filled the Niche that Netbooks once held, but how and why? Well there have been a few things that have helped. The first and most important is the success of the PC itself, where everyone had one and wanted access to the internet, school work, home accounts or games. People have become reliant on the PC as prices dropped and speed, capacity and user friendliness increased. The increase in the speed and capacity was also a hindrance, if you think about the desktop most people do not need an 8 core processor, 8 gigs of ram and 2 terabytes of storage space if all they are doing is checking their e-mails watching YouTube and playing Angry Birds. The Netbook was under powered trying to be a more portable version of the laptop, struggling with screen size and resolution and trying to keep the full size keyboard functions intact. It was cut down in size, but was still trying to be a Laptop offering the full range of functionality.
The second factor is related to the first, and that is a decoupling of the content creator and the contact consumer. Power users will still require more than a tablet to produce good content, they will need more power, more storage, more than the limited functions of a tablet and App combination can provide. Some people do not even need a tablet; a smart phone can cope with day to day use of checking emails and looking up stuff on the web.
The third factor was the development of a specific operating system for Tablets. Keeping with the ‘Ecological’ metaphor, before a great evolutionary leap there are a few dead ends, the best known was the attempt by Microsoft in the early 2000’s to have a Tablet PC, which was basically a laptop with a folding screen and a basic update to Windows XP to make it sort of usable with a stylus. Apples iOS was a game changer, stripping out anything held over from the Desktop versions of the OS and reducing it down to the fundamentals need for touch based devices, and for the consumption of media. Hand in hand with this operating system development has been the movement from the web browsers world, where everything was accessible and linked in your browser of choice to the App centric world. The YouTube app is a great example of this change, the App is designed for the imprecise nature of touch from the ground up with the limits of writing and navigating on a tablet adapted to.
The spread of the ecosystem of devices has impacted on delivery of BI, as now the information that business runs on has further decoupled from the presentation method. We now gather data in our SQL Server Data Warehouse and now connect/deliver to the SharePoint Dashboard, to the Tablet App, to the Smartphone. Each can show the same information but now presented in a way that realises the limitations of the presentation medium. The CEO can check their tablet for the high level figures presented in nice swish graphics and data designed for it. The manager on the move can get bite size chunks of info sent to the smartphone, and the power users in finance can use SharePoint, Excel and Pivot Tables & Cube connections to analyse granular data.
The PC isn’t dead; we have just forgotten what PC means. PC is Personal Computer, and with the spread of devices from the desktop to laptop, to, tablet, to smart phone to….. Whatever is to come…. all that has happened is that ‘Computing’ become more ‘Personal’ it’s now with us all the time, fitting what we need function wise into an accessible form, the PC has become more PC and we suspect that it is going to be with us for some time to come.