未加星标

Epic high five, we just beat an enterprise algorithm

字体大小 | |
[数据库(mssql) 所属分类 数据库(mssql) | 发布者 店小二03 | 时间 2016 | 作者 红领巾 ] 0人收藏点击收藏
Epic high five, we just beat an enterprise algorithm

If you’ve never contemplated an algorithm for predicting lottery numbers or the fantasy football league then this will be boring. But if you are a secret data science nerd, then please join me in an epic high five!


Epic high five, we just beat an enterprise algorithm
We reduced cluster analysis time to 66% of an enterpriseproduct

We just completed coding a cluster analysis tool for data: the competition took 90 minutes, we finished in under anhour.

The problem space is typically O(n) and is like your classical travelling salesmanproblem

Our Slalom client had a data and cost problem.

The dataproblem

Take name, address and other personal information to consolidate information across various sources. I’m writing this in a hurry so I won’t reveal any client-specific details or actual implementation- that belongs to the client now- but a similar problem space might be genealogy.

Then given: you have death records, marriage records, birth records, census records from different sources across different times and different spellings. You don’t want to duplicate data. You don’t want to lose data. So you need to form a cluster. A cluster is how you identify that 1890 census Jimmy Joehosaphat @ 10 Band St is the same as 1880 census James Johosaphat @ 10 Bond Street.

I’m sure you can think of how to do this for one person, but how do you scale this problem to millions without killing your server?

The costproblem

Cost was a concern to the client. They were spending six digits a year on enterprise licensing. Naturally, I imagine they would use the money they save to buy a water fountain connected to a never-ending supply of Monster energy drinks.

Thetools

Microsoft SQL Server 2012 and SSIS.Sometimes you gotta work with what you got.;)

Thesolution

We used SQL Server Master Data Management (MDM) similarity functions and SSIS to find similarities among all the records.

Luke has a father.

Luke has a sister.

Vader’s son is Luke.

Leah’s father is Vader.

Leah’s brother’s name is Luke.

Chewbacca is a Wookie

We stored those similarities in a table.

We then ran a stored procedure to consolidate all associative and transitive associations down to a single parent.

The Luke Skywalker cluster: Luke, Leah, Vader

We did it. We came, we saw, we kicked its ass!

Enter the Slalom solutionteam
Epic high five, we just beat an enterprise algorithm
The Ghostbusters were represented by Slalom’s Data Management (MDM) and Application Developmentteams Data management (MDM)team

Our superstars Vaibhav Sathe and Soumen Chakraborty ― personality wise they would be Winston (Ernie Hudson) and Ray (Dan Aykroyd) ― these two spent several weeks of building out the rules and fuzzy analysis for comparing the data and identifying possible cluster relationships. I worked with Soumen briefly, so I’m type casting the Aykroyd fit, but Vaibhav and I worked together at another client on a similar problem. He’s usually quiet and collected but I could picture him screaming, “I love this town!” after defeating a giant marshmallow man.

Our data management practice area lead, Teddy Wei , is a force to be reckoned with. His left-field wit is equally matched by his commitment to team and the client. He cares. Personality wise, he’s definitely the Venkman (Bill Murray).

Also to help out at the level of MS SQL Server Guru was Kevin Feit . I’m saving the last Ghostbuster for myself, so I’ll have to give Kevin the miscast of Rick Moranis… Sorry Kevin, I’m writing the article.;) Intellectually though he is an Egon, he knows mssql server and T-SQL like Neo knows Kung Fu.

Application Development team (we’re too good for acronyms)

Any by we, I mean me. I helped out on this project for a little over a week’s worth of time. My job, and the part I’m particularly proud of, was writing a stored procedure to connect all the related clusters to a single parent using set-based operations.

Did you imagine a SQL Server cursor? No, I did it with set based operations . Gosh!;)

In our final tests, my set based operation took 13% of the time of a cursor based approach. I am the Egon (Harold Ramis). Plus my hair fro’s up like Egon’s when I go too long without a haircut.

The scoreboard, race forspeed

The most exciting part of this project for me came back in June when Vaibhav (Winston) reached out to me. They were coming into trouble with performance on their final clustering stored procedure, the one that looks at all the relations between data points and creates one parent for each cluster.

They were using one approach that was taking six hours to run. The client needed a solution that completed in under two hours . And this was only a small sample of test data. If we couldn’t beat two hours to do end to end analysis, then the project would be a failure.

If six hours seems excessive here’s the problem spacemath.

You have a table with 40,000 relationships.
Epic high five, we just beat an enterprise algorithm
You’re trying to figure out which piece of spaghetti is touching which piece of spaghetti and which pieces those pieces are also touching. Ahh, recursion.

All many-to-many associative relationships.

(A==B, B==A, B==C)

Only a single level of a relationship is spelled out. You have to figure out the transitive relationships on your own.

(if A==B and B==C then A==C)

Doing this the hard way, you have to first look at record 1 and see who is related to record 1. Then those records that are related to record 1, you need to see if those records have any other records related and mark them as related to 1. Then keep going until you can’t go anymore. Then go to record2.

Without the help of any indexes, SQL Server will have to perform 40,000! reads.I couldn’t find a calculator to do a factorial that large.


Epic high five, we just beat an enterprise algorithm

It’s a recursive problem. This is why 40,000 relationships was taking six hours.

40,000! =forever

Then the algorithm was optimized to make the best use of indexes.What does that give you now? Your’re now doing factorials on a smaller level. If the average depth of a cluster is 6 nodes then turn that into 6!, which turns into 6*5*4*3*2*1, which equals 720. So if you end up with 1,700 final clusters that then gives you this optimal number.

6! * 1,700 = 1,224,000 records toread

So your best scenario, is that SQL Server will not have to look at more than 1.2 million rows for this small sample size.

With ind

本文数据库(mssql)相关术语:熊片数据库 mssql数据库 oracle数据库 pubmed数据库 access数据库 万方数据库

主题: SQLSQL Server
分页:12
转载请注明
本文标题:Epic high five, we just beat an enterprise algorithm
本站链接:http://www.codesec.net/view/480068.html
分享请点击:


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 数据库(mssql) | 评论(0) | 阅读(37)