未加星标

Stop mocking me! Unit tests in PySpark using Python’s mock library

字体大小 | |
[开发(python) 所属分类 开发(python) | 发布者 店小二03 | 时间 2018 | 作者 红领巾 ] 0人收藏点击收藏

Testing.

Fundamental in software development, and often overlooked by data scientists, but important. In this post, I’ll show how to do unit testing in PySpark using python’s unittest.mock library. I’ll do this from a data scientist’s perspective- to me that means that I won’t go into the software engineering details. I present just what you need to know.

First, a (semi) relevant clip from Family Guy:

What is a unit test? What’s a mock?

A unit test is a way to test pieces of code to make sure things work as they should. The unittest.mock library in Python allows you to replace parts of your code with mock objects and make assertions about how they’ve been used. A “mock” is an object that does as the name says- it mocks the attributes of the objects/variables in your code.

The end goal: testing spark.sql(query)

A simple way to create a dataframe in PySpark is to do the following:

df = spark.sql("SELECT * FROM table")

Although it’s simple, it should be tested.

The code and problem set up

Let’s say we work for an e-commerce clothing company, and our goal is to create a product similarity table that has been filtered on some conditions, and write it to HDFS.

Assume we have the following tables:

Products.
Columns: "item_id", "category_id". Product_similarity (unfiltered).
Columns: "item_id_1", "item_id_2", "similarity_score".

(Assume the score in product_similarity is between 0 and 1, where the closer the score is to 1, the more similar the items. See my post on similarity metrics for more details if you’re interested).

Looking at pairs of items and their scores is as simple as:

The where clause is to remove rows comparing an item to itself. The score will always be 1. How boring!

But what if we want to create a table that shows us similarity of items that are in the same category? What if we don’t care to compare shoes to scarves, but we want to compare shoes to shoes and scarves to scarves? This is a bit more complicated,and requires us to join the “products” and “product_similarity” tables.

The query would then become:

We may also want to get the most N similar items for each product, so in that case, our query would become:

(Assuming we use N = 10).

Now, what if we want the option to either compare products across categories, or only within categories? We could achieve this by using a boolean variable called same_category that results in a string same_category_q that can be passed into the overall query(using .format() ), and will be equal to the inner join above if our boolean same_category is True, and empty if it’s False. The query would then look like:

Let’s make it a bit more clear, and wrap this logic in a function that returns same_category_q :

So far, so good. We output the query same_category_q so we can test our function to make sure it returns what we want it to return.

Keeping in mind our final goal, we want to write a dataframe to HDFS. We can do this with the following function:

Adding the first part of the query and a main method to complete our script, we get:

Time to test!

The idea here is that we want to create a function, generically named test_name_of_function() , for each function in our script. We want to test that the function behaves as it should, and we ensure this by using assert statements all over the place.

Test_make_query, true and false

First, let’s test the make_query function. Recall that make_query takes in two inputs: a boolean variable and some table paths. It’ll return different values for same_category_q based on the boolean same_category . What we’re sort of doing here is like a set of if-then statements:

If same_category is True, then same_category_q = "INNER JOIN ..." If same_category is False, then same_category_q = "" (empty)

What we do is mock make_query ‘s parameters and pass them in, then test that we got our desired outputs. Since test_paths is a dictionary, we don’t need to mock it. The test script is below, with comments added for extra clarity.

And it’s as simple as that!

Testing our table creation

Next up, we need to test that our create_new_table function behaves as it should. Stepping through the function, we see that it does several things, with several opportunities to make some assertions and mocks. Note that whenever we have something like df.write.save.something.anotherthing , we need to mock each operation and its output.

The function takes spark as a parameter. This needs to be mocked. Create created_table by calling spark.sql(create_table_query.format(**some_args)) . We want to assert that spark.sql() is called only once. We’ll need to mock the output of spark.sql() as well. Coalesce created_table . Ensure that coalesce() is called with the parameter 1 . Mock the output. Write the coalesced table. We need to mock .write , and mock the output of calling it on our table. Save the coalesced table to a save path. Ensure that it’s been called with the correct parameters.

As before, the test script is below, with comments for clarity.

Finally, save everything in a folder together. You could import the functions from their respective modules if you want to, or just have everything in one script.

To test it, navigate in the command line to your folder ( cd folder/working_folder ) and call:

python -m pytest final_test.py.

You should see something like:

serena@Comp-205:~/workspace$ python -m pytest testing_tutorial.py
============================= test session starts ==============================
platform linux -- Python 3.6.4, pytest-3.3.2, py-1.5.2, pluggy-0.6.0
rootdir: /home/serena/workspace/Personal, inifile:
plugins: mock-1.10.0
collected 3 items
testing_tutorial.py ... [100%]
=========================== 3 passed in 0.01 seconds =========================== And that’s it!

There you have it. I hope this was somewhat helpful. When I was trying to figure out how to mock, I wish I had come across some sort of tutorial like this. Now go ahead, and as Stewie said, (don’t) stop mocking

me

(functions)!

本文开发(python)相关术语:python基础教程 python多线程 web开发工程师 软件开发工程师 软件开发流程

代码区博客精选文章
分页:12
转载请注明
本文标题:Stop mocking me! Unit tests in PySpark using Python’s mock library
本站链接:https://www.codesec.net/view/611281.html


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 开发(python) | 评论(0) | 阅读(75)