未加星标

Crash Your Data Science Code Before It is Too Late

字体大小 | |
[开发(python) 所属分类 开发(python) | 发布者 店小二05 | 时间 2018 | 作者 红领巾 ] 0人收藏点击收藏

How do you feel when your script crashes? If the script is in production, you have a problem. Even if it is not in production, it is not the best feeling on earth. However it may make you feel, a crashing script can be a great thing for you. A script crashed in correct place and time can save you tremendous time and effort.

As we code, we constantly make assumptions. Those assumptions are our expectations from the data, how the code will be used, and more. Sometimes we make assumptions willingly. Other times, we do not even realize when we make them. Assumptions are everywhere in code. Even dividing a number with a variable contains assumptions. Assumptions can save your code, or break it in the most unexpected place and time. Assumptions are inevitable, but can be tested against.

Let me present you how a simplest division can be disastrous. In this example, we will divide 1 with a variable (1/var). If the variable’s value is zero, you will get an error. While it is true for most cases, some libraries like numpy allow it. In those cases, the expression will result with infinite value.


Crash Your Data Science Code Before It is Too Late
While python (std) gives an error, numpy results a division by 0 with infinity.

It is better if code throws an error than keep going with erroneous data. It is because if your code stops there, you will immediately know that there is a problem. Moreover, you will know the exact place of the issue. If it does not stop, your code will go on and continue executing with corrupted/erroneous data. Depending on the code, the issue may spread like a disease. It may update data storage with corrupted data or start a training sequence that would take days. In any case, as the issue has propagated itself to later states, it will be a bigger hassle detecting and debugging the issue.

The program that did not crash on the correct spot can cause a big hassle. If program finishes without any error, it may also be difficult even spotting that there is a problem. This is an even more concrete issue in the field of data science. We often work with uncertainty and complex machine learning systems. If your system is stochastic and/or has machine learning components, weird results become harder to justify. It is common to blame the machine learning components upon unexpected results while the code is problematic, and vice-versa.

After discussing the importance of the assumptions in code, let me present my approach. Assumptions can be dangerous, but luckily they can be easily tested. The easiest way I learned is to put checks for as many assumptions you can find in your code. The checks are to make sure the assumptions still hold in runtime, every time it runs. In python, we can check the assumptions (along other things) with assert . The expression is very simple. The syntax is as follows:

assert <<the expression that must be always true>> , or:

assert <<the expression that must be always true>>, <<string that will be printed if not>>

It is as simple is that. Assert does one thing. In run-time, it checks if the expression is true. If it is not, it will crash the program. If you provided a string, as it crash, it will print that string. If the expression is true, it will not do a single thing. The program flow will simply continue.

For example, let us put a check on the length of the list. Let us assume that we never expect the list to be empty. Testing against this expectation (assumption) is as simple as this:

assert len(mylist) > 0 , or assert mylist

This single liner will crash your script at that line if mylist is empty. When it crashes, you will see that the assert has failed and you will know exactly what caused the crash.


Crash Your Data Science Code Before It is Too Late

Sometimes the assert itself may not be descriptive. It is a better practice to put a descriptive message why the assert has failed.

assert mylist, "Expected non-empty list, got an empty one"


Crash Your Data Science Code Before It is Too Late

As an another example, let us go back to the division example. In this example we will have a function that takes a numpy array and divides 1 with the array. For the sake of demonstration, let us assume that we do not expect any value zeros in the array. That is a quite valid assumption. It is not common working with infinities.


Crash Your Data Science Code Before It is Too Late

As you can see, the array contains a zero number that caused one of the elements of the resulting array to be infinite. There is a warning, but it can be easily overlooked. Instead let it go on like this, we can put an assert to check our assumption:


Crash Your Data Science Code Before It is Too Late

Like writing tests, it is best that you put asserts as you code. Do not wait to finish a big chunk of code to put asserts. As you code, you constantly think about the problem. Thus, putting checks as you change or add code is the best approach. You may not have the same mindset if you wait.

Even though the examples in this post are simple, you can use asserts for various occasions. One example might be checking the data distribution. For instance you can use KL-Divergence to check if a feature’s distribution is still similar to training in runtime. Another one that I find very useful is to check if the feature ordering is still the same after each data processing step. Especially in the field of data science, we cannot escape assuming. But we can check them to deliver high quality and valid results/code.

本文开发(python)相关术语:python基础教程 python多线程 web开发工程师 软件开发工程师 软件开发流程

代码区博客精选文章
分页:12
转载请注明
本文标题:Crash Your Data Science Code Before It is Too Late
本站链接:https://www.codesec.net/view/611122.html


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 开发(python) | 评论(0) | 阅读(94)