3 Ways to Think Like a (Great) Data Scientist

Author:Murphy  |  View: 25455  |  Time: 2025-03-22 22:45:20

If you've gone through the process of learning how to code, you understand that it isn't just about memorizing syntax. It's about learning a new way of thinking.

First you learn the tools (syntax, data structures, algorithms, etc). Then you're given a problem, and you have to solve it in a way that efficiently uses those tools.

Data Science is the same. Working in this field means you encounter problems on a daily basis, and I don't just mean code bugs.

Examples of problems that data scientists need to solve:

How can I detect outliers in this dataset?

How can I forecast tomorrow's energy consumption?

How can I classify this image as a face or object?

Photo by Kenny Eliason on Unsplash

Data scientists use a variety of tools to tackle these problems: Machine Learning, statistics, visualization, and more. But if you want to find optimal solutions, you need an approach that keeps certain principles in mind.

1. Prioritize data

Understand that data is the most important thing.

I know, that sounds really obvious. Let me explain.

One of the biggest mistakes that people who are new to data science make, as well as non-technical people who are working with data scientists, is focusing too much on the wrong things, such as:

  • Choosing the most complex models
  • Tuning hyperparameters to excess
  • Trying to solve every data problem with machine learning

The field of data science and ML develops rapidly. There's always a new library, a faster technology, or a better model. But the most complicated, cutting edge choice is not always the best choice. There's a lot of considerations that go into selecting a model, including asking if machine learning is even required.

I work in energy and a big chunk of the work I do is outlier detection – whether that's so I can remove them and train a model, or so I can flag them for further human inspection.

Some common reasons for outliers are:

  • Meter read errors
  • Unexpected operational change (power outage, equipment failure, new equipment install)
  • Unexpected external event (COVID, office party, office/building events, extreme weather)

One of the specific tasks I was faced with at work recently was to generate a threshold outside of which an electricity meter read value would be flagged as a potential outlier. This would cause an alert to be sent to the building managers.

My team and I experimented with a few different potential outlier detection methods, including isolation forest, linear regression, z-score and IQR.

In the end, we chose z-score over some of the fancier ML solutions. Why?

  • Z score was able to capture the majority of relevant anomalies (Some domain knowledge also goes into this, which I'll get into in the next section)
  • Z score is highly interpretable (it literally translates to a number of standard deviations away from the mean) and therefore easily explainable to stakeholders who may not have a tech/data science background
  • It's a simple calculation that barely requires any memory or computing resources

If you do decide to go the machine learning route, prioritize maximizing data quality through data cleaning, feature selection, and feature engineering. Do not over focus on hyperparameter tuning.

I've worked with less experienced people who wanted to jump right to hyperparameter tuning to improve a poorly performing model. In reality, the data just wasn't good, and no amount of tuning would have made it so.

Because it doesn't matter how large your search space is and how many hours you've spent tuning it using Bayesian optimization – if your data still looks like this, it's not going to cut it.

Horrific looking time series data. Image by author

Hyperparameter tuning should be the last (and least important) step. It will not make a bad model good, it will only make a good model better.

Here are some specific things you should make sure you try before hyperparameter tuning:

  • Remove outliers using z-score, modified z-score or IQR.
  • Impute (fill in) missing data using previous values or the median (interpolation). This will only work if you don't have too many missing values.
  • Add new features to the model. For time series data, you can experiment with adding different combinations of hour of day, day of week, week of month, weekend, hour of week, week of year, month, season, quarter, and so on. You can also try out cyclical time series features. If there was an unexpected event that caused the data to look very different for a certain period of time, you can add a binary indicator variable as a feature.

2. Master domain knowledge

Data science exists in pretty much any field you can think of. Energy, finance, marketing, social media, food, you name it. This is cool because it means your skills are impactful in countless domains.

Domain knowledge refers to expertise or understanding within a specific field or subject matter.

Let's take energy (again) as an example. Say you are a data scientist in the industry and want to build a model that forecasts electric consumption for a building. How do you know which features you'll use to build your model? You'll need some basic knowledge of what kinds of variables typically affect electric usage, such as:

  • Temperature
  • Hour of day
  • Day of week
  • Month of year

This is a good place to start, but you'll need to dig deeper.

What kind of building is it? Is it commercial, industrial/manufacturing, or residential? This will affect how it responds to the variables above.

Commercial buildings will probably be busiest during typical work hours (9–5 Monday through Friday), give or take 1–2 hours. This means you might include a binary variable isWorkHours or isWeekend. You'll also want to take holidays into account. If it's a manufacturing plant, the hours and days will probably be different. Residential buildings will also have different schedules and will be affected by holidays differently.

Domain knowledge doesn't just help you build a model for the first time, it also will inform the kinds of results you end up delivering. Let's revisit the outlier detection example. Typically, when flagging outliers using z-score, an outlier is considered anything with a z-score > 3 or < -3. This means that an outlier could be an outlier because it the value is too small or too large.

But domain knowledge and regular interaction with customers informed me that we didn't really want to worry about whether a meter's usage was too low, only if it was too high. We set the initial z-score threshold to 3, and only values with a score above 3 would be considered outliers. Without this knowledge we might have flooded the managers' inboxes with unnecessary email alerts.

You won't be able to build and train a truly effective model without understanding the problem's background. Increasing your domain knowledge means developing a sense of intuition for how to go about solving problems in your industry.

3. Think in terms of iterative processes

Data scientists understand that most of their work is iterative and cyclical. There is a good reason why it's called the machine learning life cycle and not the machine learning straight arrow. Often times, you'll work on a model, test it, and maybe even deploy it – only to end right back at the development environment after a few weeks in production.

A very basic machine learning life cycle diagram. Image by author

That's not a failure, that's just how it goes with data science projects.

To be successful, you need to have a system in place that not only allows for iterative development, but encourages it. One of the ways that data scientists can do this is through running experiments. Experiments are basically the process of running a model given different circumstances each time (eg testing different features, preprocessing data differently, scaling data, etc) and comparing performance. The goal is to find the best performing model for deployment.

Running experiments can be tedious if you're just running a notebook over and over, copying and pasting your results into a Google sheet and comparing them one by one. It takes a lot of time and effort to record every single piece of important information about a model this way. And I'm not just talking about metrics like MSE, R2, and MAPE. You probably also want to store:

  • The size of your train and test sets (and for time series, the start and end dates)
  • The date/time your model was trained
  • Feature importances or coefficients
  • The model as a pickle file
  • Your train and test files
  • Other visual charts such as residual charts, line/scatter plots of the data

Of course, you could store all of this manually in different locations, but who wants to do that? And what if you do all of that, only to change the model 5 more times anyway? It turns into a big mess.

There are a variety of great platforms built specifically for data science experiment tracking, with MLFlow being one of the most popular. Recently, I also discovered Neptune.ai which has an intuitive interface and great user experience. It is free to use, integrates with Jupyter notebook and Python, and allows you to track images, files, charts, and lots of other forms of model metadata and artifacts. You can see each experiment in the order it was run as well as compare experiments.

I recommend putting the time and effort into learning an experiment tracking tool and integrating it with your codebase. It will save you time and headaches.

Overall

To be a (great) data scientist, a data scientist that can solve data problems successfully, you need to look at things in a certain way. This means prioritizing data quality over model complexity, striving to increase domain specific knowledge, trusting the process and losing the expectation that you'll always be able to solve a problem in one go. Stick to these principles and you'll thrive in your career.

Tags: Career Advice Career Development Data Science Data Science Careers Machine Learning

Comment