Go Data Science with Daniel Whitenack
Data science is typically done by engineers writing code in python, R, or another scripting language. Lots of engineers know these languages, and their ecosystems have great library support. But these languages have some issues around deployment, reproducibility, and other areas. The programming language Golang presents an appealing alternative for data scientists.
Daniel Whitenack transitioned from doing most of his data science work in Python to writing code in Golang. In this episode, Daniel explains the workflow of a data scientist and discusses why Go is useful. We also talk about the blurry line between data science and data engineering, and how Pachyderm is useful for versioning and reproducibility. Daniel works at Pachyderm, and listeners who are more curious about it can check out the episode I did with Pachyderm founder Joe Doliner .Sponsors
Dice.com helps you manage your career in tech.Dice.com has a huge index of tech job opportunities that it has developed from 20 years in the business of connecting tech professionals with job opportunities. To check out Dice and support Software Engineering Daily, go to dice.com/sedaily .
Exaptive simplifies your data application development. Exaptive is a data application studio that is optimized for rapid development of rich applications. Go to exaptive.com/sedaily to get a free trial and start building applications today. [ /bs_col]
Couchbase is a document database with the flexibility of NoSQL and the power of SQL. With Couchbase Server, you can build a fast, powerful NoSQL database that scales. Running Couchbase in containers on Kubernetes, Mesos, or OpenShift is easy, and at developer.couchbase.com you can find tutorials on how to build out your Couchbase deployment.Transcript [INTRODUCTION] [0:00:00.3] JM:Data science is typically done by engineers writing code in Python, R, or another scripting language. Lots of engineers know these languages, and their ecosystems have great library support. But these languages have some issues around deployment, reproducibility, and other areas that we will get into in this episode.
The programming language Golang presents an appealing alternative for data scientist. Daniel Whitenack transitioned from doing most of his data science work in Python to writing code in Go. In this episode, Daniel explains the workflow of a data scientists and discusses why Go is useful. We also talked about the blurry line between data science and data engineering and how Pachyderm is useful for versioning and reproducibility.
Daniel works at Pachyderm, and listeners who are more curious about it can check out the episode I did with Pachyderm founder, Joe Doliner, which is in the show notes for this episode. I really enjoyed speaking with Daniel Whitenack, and I hope you enjoy this episode too.[SPONSOR MESSAGE] [0:01 :011 .3] JM: Dice helps you easily manage your tech career by offering a wide range of job opportunities and tools to help you advance your career. Visit Dice at Support Software Engineering Daily at dice.com/sedaily and check out the new Dice Careers mobile app. This user-friendly app gives you access to new opportunities in new ways. Not only can you browse thousands of tech jobs, but you can now discover what your skills are worth with the Dice Careers Market Value Calculator.
If you’re wondering what’s next, Dice’s brand new career pathing tool helps you understand which roles you can transition to based on your job title, location, and skill set. Dice even identifies gaps in your experience and suggests the skills that you’ll need to make a switch. Don’t just look for a job, manage your tech career with Dice. Visit the app store and download the Dice Careers app on Android or IOS.
To learn more and support Software Engineering Daily, go to dice.com/sedaily. Thanks to Dice for being a sponsor of Software Engineering daily. We really appreciate it.[INTERVIEW] [0:02:25 .3] JM: Daniel Whitenack is a data scientist with Pachyderm. Daniel, welcome to Software Engineering Daily. [0:02:31.3] DW:Thank you, it’s great to be here. Thanks for inviting me. [0:02:34.5] JM:Data science is a term that means different things to different people. [0:02:40.9] DW:Indeed. [0:02:42.7] JM:Yeah. We should define a little bit of what we’re talking about before we go into Go, and Python and some other languages around data science. In the modern context, in 2017, what does it mean to be a data scientist? [0:02:56.7] DW:Yeah, you’re exactly right that this term gets applied in so many different ways and so many different places. I kind of like to frame this in a couple of different ways. There’s the hashtag data science that you hear about on Twitter and other places like machines, playing board games, like Go or maybe self-driving cars and that sort of thing. Then, there’s what I would consider practical data science of day-to-day data science, which is really what a lot of people are doing in an industry that are called data scientists.
In a lot of those scenarios, they’re not attempting to play board games or make cars drive by them self. A lot of times what they’re trying to do is just figure out how to make various processes within a business more data driven.
For example, that could be anything from on the very back end sort of side of things. You might be analyzing loglines to try to predict or give some insight into your back end processes to improve uptime. Or it could be all the way on the other side of the spectrum, like sales, trying to analyze your various channels that you’re pouring money into, like your social media, and your website, and blogs, or whatever, to figure out where your customers are coming from, what to pour money into.
Really, what it comes down to is gathering data, doing some sort of analysis on that, that eventually ends up helping people make decisions and helping people make decisions that have some value within a company.[0:04:36.7] JM:How much of it involves just crunching data in a place that’s separate from the actual application, and how much of it is building models that are going into production and processing user requests on the fly? [0:04:52.7] DW:Yeah. Actually, I would say that even though a lot of visibility is given to building models and kind of predicting things, a lot of the time that a data scientist spends day-to-day, and this is proven out in various polls by Forbes and other people, is actually spent in gathering data, organizing it, parsing it and really preparing your data to be used in a useful way.
That might be aggregating data from a bunch of different sources into a single dataset, or it might be cleaning up your data, formatting it, or filling in missing values or that sort of thing. Really, a lot of time is spent in that organizing phase. Then, outside of that organizing, then you kind of build up some of these other, maybe more sophisticated things including models.Sometimes it might just even be like calculating a maximum value or account of how many users were on your website. A lot of times, the