WTF is a 'Data Scientist'?

Let me tell you a short story. I worked in NYC for a few years, first at a hedge fund, then at a small software consulting company. On both assignments I was granted an E3 visa for skilled professionals - one of the few perks of being an Australian citizen in terms of immigration. My official title for the second position was Senior Data Scientist. Every time I returned to the US, I received the standard line of questioning from DHS - they wanted to know what I did for work. I challenge anyone to try and explain the concept of data science to the lovely folk manning the customs desks at JFK airport. I had to come up with a stock answer very quickly: "I do software development and data management". Which wasn't technically a lie, but not technically the truth either. It satisfied them and meant I could sidestep explaining my fairly new and undefined field of work.

So WTF is a Data Scientist? Plenty of people have weighed in on this topic over the last few years. Ever since that article from HBR, data science has been the hot new career. So many professionals have attempted to re-brand themselves with the title, without necessarily having the experience or skillset for the label to be accurately applied. People have also been falling over themselves to add their definition to the pile. Everyone is now an expert in what a data scientist should and should not be. I have my own opinion, but feel free to ignore it (or not) along with the others.

At the highest level, a data scientist is a problem solver. The position exists at the intersection of business, statistics and computer science, with a healthy dose of scientific method. We use data to solve business problems; problems which are usually posed at a very high level and in fairly uncertain terms. A typical example: "Our revenue over the winter months is typically lower than throughout the rest of the year. Why is this happening and how do we solve it?" Now this may be as simple an answer as "We sell swimsuits, so duh!" In other verticals it may not be so obvious - what if instead of revenue drops in winter, it is slowdowns in website traffic? Sometimes the business stakeholder doesn't even have a specific metric in mind - a question could be as arbitrary as "why aren't we making money?" It's actually these questions I prefer to tackle - it gives us complete control over the process from data collection through to results operationalizion and allows us the flexibility to tackle the problem in the way we see fit.

The classic Venn diagram is here:

I like this version, as it stresses the importance of domain expertise, which is an often-overlooked piece of the puzzle. Without domain expertise, or access to subject matter experts, a data scientist won't necessarily know if their discovery is BS or not. This segues nicely into what I feel are basic requirements of the role from a personality or behavioral perspective. Curiosity is important, as is a hunger to learn. This intrinsic value placed on learning can be focused on the technical skills or the domain, but ideally both. I work in aerospace, so the ability and willingness to acquire a fundamental knowledge of the engineering systems we are being asked to assess is critical.

From a technical perspective, most folks I've seen become successful in the role are math majors with CS experience rather than the other way around. The statistical chops are crucial - this is the key value add to most companies. Most organizations have software engineers whose expertise can be relied upon to help scale and fully flesh out solutions that are rooted in analytics and data science. A healthy respect for software frameworks, algorithms, data structures and the product development lifecycle also helps. A the end of the day, data scientists are experts in taking arbitrarily defined business problems, collecting data, analysis, modelling and operationalizing results, usually in an iterative fashion.

In terms of tooling, the ever-present R vs. Python debate rages on. We use both extensively, leaning on Python when it comes time to productize and operationalize our solutions. Relational and non-relational database familiarity is important, as is the ability and willingness to learn the multitudes of new "Big Data" tools that seem to be released every other week. With IoT officially becoming a thing, it helps to have a passing interest in cloud technologies - AWS and MS Azure are the big players in this space.

One of the common misconceptions about the work we do is that data science is the same thing as machine learning, or AI. Machine learning is a tool or technique that data scientists may (or, more usually in my experience, may not) use to solve a given problem. Far more regularly, we can use simple regression or even heuristics to get similar results or solve basic problems in a far less complex way. More important is knowing which tool to apply to which problem to solve it subject to the constraints involved. For example, the infamous Netflix prize was a crowd-sourced solution to their recommendation algorithm, with a million dollar prize payout. The eventual winning team increased accuracy by ~10%, but their ensemble model was so complicated and difficult to deploy at scale, that it never made it into production. Sometimes simpler is better, and it all depends on the problem to be solved.

We're still not sure what a Data Scientist actually is. My opinion, for what it's worth, is that they are folks who solve business problems by applying quantitative scientific methods, with the know-how to operationalize their solutions at scale. Curious and meticulous by nature, they love learning new things, both technically and across domains. Proficient in statistics and computer science, they are multi-disciplinary talents who get things done.

1 view

©2019 by Grumpy Old Nerd

  • GitHub-Mark-120px-plus