The statistics around data science hiring are misleading at best and outright false in many cases. This causes businesses to make critical mistakes when hiring or attempting to retain data science and machine learning talent. It also translates into confusion for those trying to break into the field.
We’re being bombarded with statistics about skills, experience, education, etc. gathered with flawed methodologies. As a field we’ve moved way beyond analytics but when it comes to understanding the field, we’re still trapped in the olden days of basic metrics. To understand how to hire and how to break into the field, we need models, not more metrics.
I’ve built a set of models that classify employees based on the work they do. I’ve tested it on over 1.6 million experience descriptions. In my first passes, accuracy varied greatly based on what I call job type complexity. Job titles cannot be trusted as accurate labels for what an employee does, especially in complex jobs like data scientist.
To improve accuracy, I had to build capability-based models. Moving away from job titles, skills, years of experience, education…all the conventional wisdom metrics, reveals the bad assumptions baked into those metrics.
Flaws In The Current Metrics
The biggest flaw behind these numbers is what I talk about in the introduction; they are all based on job titles. If I say I’m a data scientist, does that mean I’m a data scientist? Of course not. Are all data scientists the same? Of course not. The assumption that job title alone is accurate corrupts most of the metrics used to describe our field (and many others).
In order to get an accurate picture of the field, data scientists cannot be counted or analyzed based on job title. They need to be classified as data scientists, then segmented based on their actual work. That changes the numbers significantly.
A New Understanding Of Data Science
Remove just that flawed assumption and we get a new understanding of complex fields. In this post I’m using data science and machine learning as a specific example of what is a problem across job types.
Data analysts and researchers are the most frequently mislabeled workers. This is indicative of a field which is stratified with blurred edges and that’s a good definition for all complex fields.
There’s a lot of analyst type work being done by data scientists and data science work being done by analysts. Data visualization, data cleaning/wrangling, and working to understand raw datasets are part of the data science and analyst skill set.
When does a data analyst become a data scientist and when is a data scientist over titled? The overlap at this end of the field is so great that there’s no good distinction. Both use basic models to gain value from data. Both use advanced analytics and tools.
This type of role is more accurately described as a Data Science Analyst. It is the most common segment of data scientists.
At the other end of the spectrum are researchers. There are several overlaps between senior data scientists focused on advanced projects and researchers (both in and out of academia). Based on job activities, these roles are more accurately described as Data Science Researcher or Machine Learning Researcher.
When Bad Data Is Used To Inform Actions
There are more segments. The important take away is breaking the field into activity-based roles provides a more accurate understanding of what we should be teaching and hiring for.
This is common sense. Different types of work lead to different requirements. The field can’t be lumped together around an arbitrary job title. We also can’t measure the field based on data that doesn’t take activities into account.
Using basic metrics leads to bad assumptions and those lead to more bad metrics. If most job descriptions require 2-3 years of experience, then most companies and employees will accept that as standard. If most job descriptions require a Masters or PhD, then most companies and employees will accept that as standard. When the survey takers analyze job descriptions and data science hires, they come up with metrics that are self-reinforcing.
A wag the dog effect is to blame for a lot of the confusion around requirements. Few companies are measuring the impact of requirements on job performance. Facebook and Google, among others, have dropped poorly supported requirements in favor of inclusive job descriptions. Their job descriptions talk about what the person will be doing and who they’ll be working with. The descriptions are designed to attract candidates based on capabilities rather than exclude as many candidates as possible based on unsupported metrics.
Better Data Leads To Hard Questions And More Complete Answers
The field has grown beyond just generalists and averages. If a job can’t be described in simple terms, years of experience, education, skills, etc., creating a job description gets a lot harder. Screening resumes currently relies on these metrics to score candidates and that’s the basis for most automated resume screening software. Even in machine learning based software, average metrics are the dominant features.
Aspiring data scientists want concrete objectives. Talking about capabilities over skills moves the objective towards business value as the most important measure of capability and readiness to enter the field. Without an average target, building a career has a larger element of uncertainty.
The same is true for hiring. More advanced systems are required to automate resume screening once job descriptions move to capabilities. Businesses need to connect capabilities to value. Neither are simple.
However, capabilities can tie directly to a project; specifying which capabilities are needed to complete each piece of the project. That is the bridge between a job description and ROI. It’s also a guide for those entering the field. The best capabilities to master have the most business value.
Complex Jobs Are The New Normal
Like it or not, we’ve entered the age of complex jobs. That’s what my research has been aimed at over the last year and job titles like Research Data Scientist and Data Science Analyst point to the solution.
The model I’ve found success with classifies roles based on the multiple job types required to be successful. A secondary model then pulls the specific capabilities from the role. I’ve described capabilities in a previous post, so I won’t rehash that here. Suffice to say that capabilities are activities associated with what the employee does to build value for the business, how they apply their skills/experience/education/etc. to produce the desired business outcomes.
Requiring Python, going deeper to TensorFlow, 2 years of experience; even combined, these are too vague. The key terms or years of experience don’t correlate with employee performance. There’s research on personnel selection methodologies going back to the 90’s to back that up.
The most successful candidate screening methods involve capabilities assessment. I’ve found evidence that capabilities have a strong correlation with how employees working in complex fields describe their role. I can see companies that hire large numbers people for complex jobs using capabilities in their job descriptions. This is the new normal.