未加星标

How to JOIN Dirty Data in SQL Server

字体大小 | |
[数据库(mssql) 所属分类 数据库(mssql) | 发布者 店小二04 | 时间 2017 | 作者 红领巾 ] 0人收藏点击收藏

In this puzzle, we’re going to learn how to join dirty data using SQL Server. Many times you’re forced to use someone else’s data. Since it is their system and data, not yours, you can’t always fix the inconsistently formatted data before it is in the database.

So how can you use it? Do this puzzle and learn how!

Solving puzzles is a great way to learn SQL. Nothing beats practicing what you’ve learned. Once you have figured out the puzzle, post you answer in the comments so we all can learn from one another. We also discuss puzzle and more in Essential SQL Learning Group on Facebook. Be sure to find us there!

How Do You JOIN Dirty Data in SQL Server?

In business, the data you get to work with isn’t always clean. For instance, I know of a major automotive manufacturer, whose part number prefix and suffix are separated with dashes in their engineering system, and spaces in their EDI system (e.g. ‘1234-4567’ versus ‘1234 4567’).

This makes it difficult to compare and join data from the two system, but in today’s puzzle, we’re going to explore this issue and learn how to do the joins.

For our puzzle we’ll use these tables:

DECLARE @PartDescription TABLE
(
PartNumber Varchar(20),
Name Varchar(20)
)
DECLARE @PartSales TABLE
(
PartNumber Varchar(20),
Year int,
Sales float
)

Your goal is to output PartNumber, Name, Year, and Sales. Sort your results by Name and Year.

Also, the PartNumber you display should have dashes separating the prefix (first four characters) from the suffix (last four characters).

Keep in mind the part number data is inconsistently formatted. You can find the sample data a here .

JOIN Dirty Data No Results!

Let’s see what happens when we try to join the data without first cleaning it. In a perfect world, we’d expect the correct results however since the part number is in consistently formatted, the join condition fails and the query returns no results.

SELECT P.PartNumber,
P.NAme,
S.Year,
S.Sales
FROM @PartDescription P
INNER JOIN @PartSales S
ON P.PartNumber = S.PartNumber
ORDER BY P.Name, S.Year

Obviously our data needs a good cleaning.

Getting Results with Dirty Data using REPLACE

One way to solve the puzzle is to replace the space and asterisk with a dash. Here is a query to do that:

SELECT REPLACE(REPLACE(P.PartNumber,'*','-'), ' ','-') as PartNumber,
P.NAme,
S.Year,
S.Sales
FROM @PartDescription P
INNER JOIN @PartSales S
ON REPLACE(REPLACE(P.PartNumber,'*','-'), ' ','-') =
REPLACE(REPLACE(S.PartNumber,'*','-'), ' ','-')
ORDER BY P.Name, S.Year

In this example we used the REPLACE function to search for incorrect characters, such as the asterisk and space, and substituted them with a dash.

You’ll notice that the REPLACE functions are nested. This allows us to get around a limitation with REPLACE.

Here is the result of the join once the data is cleansed.


How to JOIN Dirty Data in SQL Server

A flaw of this method is that you do need to the which characters you wish to cleanse. In our case we know that spaces and asterisks are unwanted, but what if we encounter an equals sign?

The way our solution is written, these would not be found and replaced with a dash. The join condition would fail.

Also, as you add more characters to search, you need to nest more REPLACE functions. After a while, this becomes cumbersome to read and error prone to write.

In a future version of SQL, we’ll be able to use TRANSLATE . Then, I’m hoping we can replace all special characters as so:

SELECT REPLACE(REPLACE(P.PartNumber,'*','-'), ' ','-') as PartNumber,
P.NAme,
S.Year,
S.Sales
FROM @PartDescription P
INNER JOIN @PartSales S
ON TRANSLATE(P.PartNumber,'!@#$%^&*()_+=','-------------') =
TRANSLATE(S.PartNumber, '!@#$%^&*()_+=','-------------')
ORDER BY P.Name, S.Year JOIN Dirty Data by Parsing the Part Number

Another way to solve is this problem is by parsing. That is we can separate the part number into the prefix and suffix. Once done, we join on these to match rows from the two tables.

Here is a diagram of the part number structure. In our example the prefix is the first four characters, and the suffix the last four.


How to JOIN Dirty Data in SQL Server

The prefix is calculated using the LEFT function. LEFT(P.PartNumber,4) returns the first four characters. Likewise, RIGHT(P.PartNumber,4) returns the last four characters.

Now all we need to do is write the join condition to compare the prefix and suffix from each table. Check out this query to see what I mean:

SELECT REPLACE(REPLACE(P.PartNumber,'*','-'), ' ','-') as PartNumber,
P.NAme,
S.Year,
S.Sales
FROM @PartDescription P
INNER JOIN @PartSales S
ON LEFT(P.PartNumber,4) = LEFT(S.PartNumber,4) AND
RIGHT(P.PartNumber,4) = RIGHT(S.PartNumber,4)
ORDER BY P.Name, S.Year Conclusion

There are certainly other ways to solve this puzzle. I hope by seeing the two solutions here, you have seen there are two drastically different approaches to the answer.

If you have a good answer, let me know. I would love to see it!

本文数据库(mssql)相关术语:熊片数据库 mssql数据库 oracle数据库 pubmed数据库 access数据库 万方数据库

主题: SQLSQL ServerFacebookEDI
分页:12
转载请注明
本文标题:How to JOIN Dirty Data in SQL Server
本站链接:http://www.codesec.net/view/530576.html
分享请点击:


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 数据库(mssql) | 评论(0) | 阅读(53)