Epic high five, we just beat an enterprise algorithm
If you’ve never contemplated an algorithm for predicting lottery numbers or the fantasy football league then this will be boring. But if you are a secret data science nerd, then please join me in an epic high five!
We reduced cluster analysis time to 66% of an enterpriseproduct
We just completed coding a cluster analysis tool for data: the competition took 90 minutes, we finished in under anhour.The problem space is typically O(n) and is like your classical travelling salesmanproblem
Our Slalom client had a data and cost problem.The dataproblem
Take name, address and other personal information to consolidate information across various sources. I’m writing this in a hurry so I won’t reveal any client-specific details or actual implementation- that belongs to the client now- but a similar problem space might be genealogy.
Then given: you have death records, marriage records, birth records, census records from different sources across different times and different spellings. You don’t want to duplicate data. You don’t want to lose data. So you need to form a cluster. A cluster is how you identify that 1890 census Jimmy Joehosaphat @ 10 Band St is the same as 1880 census James Johosaphat @ 10 Bond Street.
I’m sure you can think of how to do this for one person, but how do you scale this problem to millions without killing your server?The costproblem
Cost was a concern to the client. They were spending six digits a year on enterprise licensing. Naturally, I imagine they would use the money they save to buy a water fountain connected to a never-ending supply of Monster energy drinks.Thetools
Microsoft SQL Server 2012 and SSIS.Sometimes you gotta work with what you got.;)Thesolution
We used SQL Server Master Data Management (MDM) similarity functions and SSIS to find similarities among all the records.
Luke has a father.
Luke has a sister.
Vader’s son is Luke.
Leah’s father is Vader.
Leah’s brother’s name is Luke.
Chewbacca is a Wookie
We stored those similarities in a table.
We then ran a stored procedure to consolidate all associative and transitive associations down to a single parent.
The Luke Skywalker cluster: Luke, Leah, Vader
We did it. We came, we saw, we kicked its ass!Enter the Slalom solutionteam
The Ghostbusters were represented by Slalom’s Data Management (MDM) and Application Developmentteams Data management (MDM)team
Our superstars Vaibhav Sathe and Soumen Chakraborty ― personality wise they would be Winston (Ernie Hudson) and Ray (Dan Aykroyd) ― these two spent several weeks of building out the rules and fuzzy analysis for comparing the data and identifying possible cluster relationships. I worked with Soumen briefly, so I’m type casting the Aykroyd fit, but Vaibhav and I worked together at another client on a similar problem. He’s usually quiet and collected but I could picture him screaming, “I love this town!” after defeating a giant marshmallow man.
Our data management practice area lead, Teddy Wei , is a force to be reckoned with. His left-field wit is equally matched by his commitment to team and the client. He cares. Personality wise, he’s definitely the Venkman (Bill Murray).
Also to help out at the level of MS SQL Server Guru was Kevin Feit . I’m saving the last Ghostbuster for myself, so I’ll have to give Kevin the miscast of Rick Moranis… Sorry Kevin, I’m writing the article.;) Intellectually though he is an Egon, he knows mssql server and T-SQL like Neo knows Kung Fu.Application Development team (we’re too good for acronyms)
Any by we, I mean me. I helped out on this project for a little over a week’s worth of time. My job, and the part I’m particularly proud of, was writing a stored procedure to connect all the related clusters to a single parent using set-based operations.
Did you imagine a SQL Server cursor? No, I did it with set based operations . Gosh!;)
In our final tests, my set based operation took 13% of the time of a cursor based approach. I am the Egon (Harold Ramis). Plus my hair fro’s up like Egon’s when I go too long without a haircut.The scoreboard, race forspeed
The most exciting part of this project for me came back in June when Vaibhav (Winston) reached out to me. They were coming into trouble with performance on their final clustering stored procedure, the one that looks at all the relations between data points and creates one parent for each cluster.
They were using one approach that was taking six hours to run. The client needed a solution that completed in under two hours . And this was only a small sample of test data. If we couldn’t beat two hours to do end to end analysis, then the project would be a failure.
If six hours seems excessive here’s the problem spacemath.You have a table with 40,000 relationships.
You’re trying to figure out which piece of spaghetti is touching which piece of spaghetti and which pieces those pieces are also touching. Ahh, recursion.
All many-to-many associative relationships.
(A==B, B==A, B==C)
Only a single level of a relationship is spelled out. You have to figure out the transitive relationships on your own.
(if A==B and B==C then A==C)
Doing this the hard way, you have to first look at record 1 and see who is related to record 1. Then those records that are related to record 1, you need to see if those records have any other records related and mark them as related to 1. Then keep going until you can’t go anymore. Then go to record2.
Without the help of any indexes, SQL Server will have to perform 40,000! reads.I couldn’t find a calculator to do a factorial that large.
It’s a recursive problem. This is why 40,000 relationships was taking six hours.
Then the algorithm was optimized to make the best use of indexes.What does that give you now? Your’re now doing factorials on a smaller level. If the average depth of a cluster is 6 nodes then turn that into 6!, which turns into 6*5*4*3*2*1, which equals 720. So if you end up with 1,700 final clusters that then gives you this optimal number.
6! * 1,700 = 1,224,000 records toread
So your best scenario, is that SQL Server will not have to look at more than 1.2 million rows for this small sample size.With ind
本文数据库（mssql）相关术语:熊片数据库 mssql数据库 oracle数据库 pubmed数据库 access数据库 万方数据库
本文标题：Epic high five, we just beat an enterprise algorithm