The trouble with twins

Sadly there hasn’t been much love shown to Data Quality Services (DQS) in the last few releases of SQL Server, and I don’t think there will be in the coming SQL Server vNext release.

DQS is slow, it’s cumbersome, it’s interface is terrible, it’s does not scale well and takes a number of headache inducing workarounds to get it to a level of decent performance.

However it works, and can work well when it does. However as with all things data when it interacts with people it hurts. Some background to the project, DQS is matching people from separate data sources, each without a common identifier or reference across the systems. So we running some matching to on the data. We can add synonyms, so names like David, can be referred to as Dave, or John, Jon and Jonathan, so in the event of one system using one name it will match with the data from the other. 
So in match we can get a level of confidence, expressed in a percentage, in fact the process uses not just the name but a bunch of other factors, date of birth, address and so on. But one thing it breaks down on is a certain type of twins.
What sort? twins with identical, and nearly identical names. 
Identically named twins, surely no one does that? Sorry yes they do. I have encountered and confirmed that there (At least in one area 4 sets) twins named the same, without a middle name to differentiate them.
As for nearly identical names, again yes, for example Jane and Jade, only one letter different. So when matching them it gets a high level of success, in the high 90’s. What to do? The confidence level can’t be changed as it will ignore or create people when it should or shouldn’t.
Ahhhh people!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s