Mathematics I Look for in Data Scientist Interviews
As someone who has been involved in hiring data scientists and applied scientists in Amazon for the past few years, either as the hiring manager, interviewer or interviewee, I have come to realize that while most candidates are familiar with the latest machine learning models, they often lack a solid understanding of the fundamentals of mathematics, statistics, and probability. This gap in foundational breadth of science knowledge can limit their ability to effectively define the right hypothesis to solve problems, collect the right type of sample or interpret the results of their scientific work. That is why I decided to create this post to explain the most common foundational topics for data scientists. The good news is that even an overall familiarity with these topics can set us apart from other candidates. These mathematical skills are powerful tools for better understanding the problems that we encounter daily as data scientists. By building these muscles and exercising them over time, we will reap the benefits throughout our careers.
I have broken down these foundational mathematical topics into the following four buckets (which are not always mutually exclusive):
- Statistics
- Calculus
- Linear Algebra
- Probability
I will go over each topic separately in this post and then we will walk through a few Python examples for each to develop a deeper understanding . Given the ambitious breadth of topics that will be included in this post, I would not be able to go very deep in some of them and will instead refer to other in-depth posts at certain parts.
Let's get started!
1. Statistics
Statistics is the area where I have seen most of the improvement opportunities for our new hires. Statistics enables us to make inferences about the data, identify and quantify relationships among various variables and validate hypothesis. I will cover three main areas in the realm of statistics, including descriptive statistics, inferential statistics and sampling, with some examples. As an example, in Data Science interviews we encounter situations where we want to find existing patterns in data, which requires a foundation of descriptive statistics. Or in questions where we are looking for a property of an area in question, we can collect and rely on a representative sample, which is part of inferential statistics.
Let's look at each one separately.
1.1. Descriptive Statistics
Descriptive statistics helps us summarize and describe (as the name suggests) the main features of a data set. Some common measurements are mean, median and standard deviation. Let's look at a simple example together:
# import libraries
import numpy as np
import pandas as pd
# create a dataset
data = {'Values': [10, 12, 15, 20, 22, 25, 30, 35, 40, 50]}
# create a dataframe of the dataset
df = pd.DataFrame(data)
# calculate descriptive statistics
mean = df['Values'].mean()
median = df['Values'].median()
std_dev = df['Values'].std()
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Standard Deviation: {std_dev}")
Results:

As explained before, the code snippet calculates the descriptive statistics of mean, median and standard deviation, which is used to summarize the data to better understand it.
One of the popular methodologies of descriptive analysis is called "Univariate Analysis", which involves analyzing a single variable to summarize its main features, such as mean, median, model, range, standard deviation, etc. I have an in-depth tutorial on that topic so I will not cover it here but feel free to review it in the link below:
1.2. Inferential Statistics
Inferential statistics allows us to make inferences about the populations (i.e. all of the data) by using smaller samples of that population. In most data science interviews, this is the area where you would directly face statistical questions (although you'd still need to indirectly rely on your knowledge of descriptive statistics). For example, questions asking about how "confident" you are about a specific "hypothesis" are related to this part. Inference is usually drawn through concepts such as confidence intervals and hypothesis testing. This area can be quite large so I will include an example and then will include links to other posts that go deeper into some of the topics related to inferential statistics.
In the example below, we will calculate the mean and the 95% confidence interval around that mean.
# import libraries
import numpy as np
import scipy.stats as stats
# generate sample data
np.random.seed(0)
data = np.random.normal(loc=20, scale=5, size=100) # mean=20, std_dev=5
# calculate confidence interval for the mean
confidence_level = 0.95
mean = np.mean(data)
sem = stats.sem(data) # standard error of the mean
confidence_interval = stats.t.interval(confidence_level, len(data)-1, loc=mean, scale=sem)
# print results
print(f"Mean: {mean}")
print(f"95% Confidence Interval: {confidence_interval}")
Results:

The results indicate that the average value of the data set we generated is around 20.30. The 95% confidence interval implies that we are 95% confidence that the true population mean lies between 19.29 and 21.30. We are assuming that our data set is a "sample" from a larger "population". Since we are unable to calculate what the mean of the actual population is, we rely on inferential statistics to tell us that based on this small sample, confidence interval provides a range that likely contains the actual mean of the population.
One powerful and popular methodology used in inferential statistics is called "Multivariate Analysis", which includes techniques for analyzing multiple variables simultaneously to understand the relationship among them. I have a separate in-depth tutorial on that topic so I won't cover it in this post but I will include it below for reference:
As we saw in the example above, sampling is a key area of focus in statistics so we will discuss that next.
1.3. Sampling
Sampling is a foundational technique in statistics and data science – I cannot emphasize enough how important this topic is. In my experience, a number of less successful candidates just brush over the sampling portion of their interview exercises and even when we as interviewer attempt to guide them towards the sampling portion, they say "we just collect a random sample". Although a random sample may be the right approach at times, it is not always the right solution. Collecting the right sample can be the difference between a successful and unsuccessful project and therefore science teams care about it.
Sampling is used to select a representative subset of data, called the "sample" from a larger "population". The main idea is that since we usually do not have the resources to evaluate the entire population, we collect representative samples and make observations based on the collected samples about the characteristics of the population – and it works very well!
There are many sampling methodologies available but let's cover a few most common ones here:
- Simple Random Sample or SRS: This is probably the first method that everyone thinks about when they hear about sampling. It basically means that we collect a random sample from a population and consider that to be a representative one. It implies that every member of the population has an equal chance of being selected. There are cases where this is not an ideal approach. For example, let's say we have a population of individuals where 90% of them are above 50 years old and only 10% are younger than 50 years old. If we collect a random sample of 100 and end up randomly collecting 95 samples from the ones older than 50, it means that our sample now implies that 95% of the population is older than 50, which we know is not true. There are sampling methodologies to account for these types of populations, which we will look at next.
- Stratified Sample: This means that the population is divided into distinct subgroups based on specific characteristics – similar to our example above. Therefore, samples are drawn from each subgroup separately to ensure each subgroup is adequately represented. This of course requires us to know about the existing subgroups.
- Systematic Sampling: This is when a random starting point is selected and then every nth member of the population is chosen. This method is efficient for large populations with minimal variations.
If you are interested in more in-depth review of sampling, I have a separate post dedicated to sampling, which you can look for further reading:
Let's look at a simple random sampling example together here:
# import libraries
import numpy as np
# creating a population of 1000 values
population = np.arange(1, 1001)
# select a random sample of 10 values
sample = np.random.choice(population, size=10, replace=False)
print("Random Sample:")
print(sample)
Results:

As expected from the code, a random set of 10 values were selected so there is no rhyme or reason for what values were selected from the population, which makes it a good random sample.
We will cover calculus next!
2. Calculus
Concepts from calculus is one of the main topics used in various optimization strategies in machine learning models during training. Specifically, it is through calculus that we can start to understand how changes in the input variables affect the output variable – with this knowledge we develop various training and optimization strategies.
Calculus might not seem like an area of interest in data science interviews until we start thinking about machine learning. Machine learning models are essentially based on minimizing or maximizing a given objective function, which is an optimization task. Optimization relies on foundational algorithms such as gradient descent, which itself is based on derivatives. In today's data science interviews, we are guaranteed to be asked about machine learning, which brings us back to the knowledge of calculus.
Let's look at two of the most important topics: derivatives and gradient descent.
2.1. Derivatives
I know that not everyone has the most pleasant memories from studying derivates but they are crucial in developing a better understanding of machine learning training and optimization. Derivatives represent the rate of change of a function with respect to its variables – One could argue that if you understand this very basic concept, you can create a neural network's backpropagation from scratch but we are getting ahead of ourselves. In machine learning, derivatives are used to determine the slope of the cost function, which helps us to find the optimal model. Let's look at an implementation in Python.
# import libraries
import sympy as sp
# define the variable x and function f(x)
x = sp.Symbol('x')
f = x**2 + 2*x + 1
print(f"f(x)= {f}n")
# calculate the derivative
derivative = sp.diff(f, x)
print("Derivative of f(x):")
print(derivative)
Results:

Results look correct, based on what I remember from derivatives.
2.2. Gradient Descent
Gradient descent, which you may have heard in the context of neural networks, is an iterative optimization algorithm that is used to minimize an objective function (which would be the goal of a model training). The idea is quite simple and I have further simplified it in this definition but by calculating the derivative (or gradient), we can take steps in the direction that reduces the error. If we take enough of these steps, eventually we will converge to the optimal solutions.
In order to better understand it, we can visualize it in Python. Let's go back to the previous example of f(x) = x^2 + 2*x + 1
and the goal is to find the minimum value using gradient descent.
# import libraries
import numpy as np
import matplotlib.pyplot as plt
# define function and its derivative
def f(x):
return x**2 + 2*x + 1
def df(x):
return 2*x + 2
# gradient descent parameters
x = 5 # initial guess (or sometimes called a seed)
learning_rate = 0.05
n_iterations = 50
# perform gradient descent
x_values = [x]
for _ in range(n_iterations):
gradient = df(x)
x = x - learning_rate * gradient
x_values.append(x)
# plot the function and descent path
x_range = np.linspace(-6, 4, 100)
y_range = f(x_range)
plt.plot(x_range, y_range, label='f(x) = x^2 + 2x + 1')
plt.scatter(x_values, [f(i) for i in x_values], color='red')
plt.title('Gradient Descent Visualization')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.legend()
plt.show()
Results:

This is a simple function so we knew that the minimum would be at x = -1
but we assumed that we didn't have that information and started at x = 5
. Then through the process of gradient descent, at a learning rate of 0.05
and with 50
iterations, the process eventually moved away from x = 5
and got closer and closer to the minimum value at x = -1
. Impressive!
Let's move on to linear algebra next.
3. Linear Algebra
Linear algebra is the mathematical branch dealing with vectors, matrices and the operations among them. These topics may seem basic but they are the true foundation of many machine learning models such as neural networks, which are some of the most in-demand areas the data science roles these days. Many of machine learning models and approaches such as reinforced learning that we encounter during data science interviews are implemented using vectors, matrices and the operations among them, which we will cover in this section.
Let's briefly define these concepts and then look at a few examples.
3.1. Vectors and Matrices
These are used to represent data sets or even embeddings in more advanced contexts. When it comes to matrices and implementation of machine learning models, rows often correspond to individual data points and columns usually represent features. Let's look at an example.
# import libraries
import numpy as np
# define a vector
vector = np.array([1, 2, 3])
print("Vector:")
print(f'{vector}n')
# define a matrix
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("Matrix:")
print(matrix)
Results:

You can see how the vector is in one row and the matrix has three rows and three columns
3.2. Matrix Operations
Now that we know how to define matrices, it is time to start manipulating them with operations such as multiplication (used to combine matrices or to transform them), transposition (flipping a matrix over its diagonal to interchange rows and columns) and inversion (finding a matrix that when multiplied by the original matrix, yields the identity matrix, with 1s in the main diagonal and 0s elsewhere). We will see these in the example below:
# import libraries
import numpy as np
# create matrices
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
print("Matrix A:")
print(A)
print()
print("Matrix B:")
print(B)
print()
# multiplication
C = np.dot(A, B)
print("Matrix C (Multiplication Result):")
print(C)
print()
# transposition
A_transpose = np.transpose(A)
print("Transpose of Matrix A:")
print(A_transpose)
print()
# inversion
A_inverse = np.linalg.inv(A)
print("Inverse of Matrix A:")
print(A_inverse)
print()
# verify inversion
D = np.dot(A, A_inverse)
print("Multiplication of A and its inverse:")
print(D)
Results:

As we can see, Matrix C is the multiplication result of matrices A and B. Then we see how transposing matrix A flipped the matrix over its diagonal (note how 2 and 3 changed positions) and finally we see that when we multiply matrix A and its inverse, we get to the identity matrix, with some rounding error.
3.3. Eigenvalues and Eigenvectors
Eigenvalues are numbers (or scalars) that indicate how much a vector is scaled (or stretched) during a linear transformation, without changing its direction. Eigenvectors on the other hand are vectors whose direction remains unchanged when a linear transformation is applied. Do not worry if you did not fully understand this, since their application is more important to understand than the pure concept. These are quite helpful in dimensionality reduction exercises, such as principal component analysis or PCA, which is one of many techniques used in identifying most important features in a data set or a machine learning model. I have a full tutorial on this topic so I won't go into details here but I will provide an example for reference. Here's the PCA tutorial:
# import libraries
import numpy as np
# define a matrix
A = np.array([[3, 1], [1, 3]])
# calculate eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)
print("Eigenvalues:")
print(eigenvalues)
print()
print("Eigenvectors:")
print(eigenvectors)
Results:

Lastly, we will cover probability.
4. Probability
Let's face it – Businesses deal with uncertainties and hire us as scientists to help them be better prepared for such uncertain futures by making predictions and forecasts. So logically, there is a large emphasis in various interviews on questions dealing with uncertainties and forecasting – Probability provides the foundation understanding uncertainty.
Whether this is in the traditional sense, such as making time series forecasts in the stock market or inventory management or even in the more recent contexts such as large language models (LLMs), where each token is predicted based on the past tokens, in a path-dependent way. Therefore, probability helps data scientists understand uncertainty, model randomness and finally make predictions. For this section, I have selected two main topics to talk about – let's jump in.
4.1. Probability Distribution
Probability distributions, such as normal, binomial, uniform, etc., describe how data points are distributed. If we see the distribution of the data, we can make much better predictions about the data and how it behaves. If there is one distribution that you learn, learn about "Normal" distribution. Let's first generate a normal distribution and then we can talk about its characteristics a bit more.
# import libraries
import numpy as np
import matplotlib.pyplot as plt
# generate a normal distribution
mean = 0
std_dev = 1
samples = np.random.normal(mean, std_dev, 1000)
# plot histogram
plt.hist(samples, bins=30, density=True)
plt.title('Normal Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
Results:

The generic look above is what we consider a normal distribution, which has a few important characteristics:
- Symmetry: Distribution is symmetric around the mean (which is 0 in the above example) – it means that the left of the mean looks symmetrical to the right of the mean
- Mean, Median, Mode Equality: In a normal distribution, mean, median and mode are all equal and are located in the center of the distribution. Mean is the average, median is the middle value in a data set when the values are sorted (ascending or descending) and mode is simply the value in the data base with the highest frequency.
- 68, 95, 99.7 Rule (aka empirical rule): This is an important and useful one. It means that approximately 68% of the data falls within one standard deviation of the mean, 95% of the data falls within two and 99.7% within three! I'll show this in the graph below.
# import libraries
import numpy as np
import matplotlib.pyplot as plt
# generating a normal distribution
mean = 0
std_dev = 1
samples = np.random.normal(mean, std_dev, 1000)
# defining standard deviation ranges
x = np.linspace(mean - 4*std_dev, mean + 4*std_dev, 1000)
y = (1 / (std_dev * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mean) / std_dev) ** 2)
plt.plot(x, y, color='black')
# adding horizontal lines for the 68-95-99.7 rule
plt.axvline(mean - std_dev, color='blue', linestyle='--', linewidth=1, label='68% of data (±1 SD)')
plt.axvline(mean + std_dev, color='blue', linestyle='--', linewidth=1)
plt.axvline(mean - 2*std_dev, color='green', linestyle='--', linewidth=1, label='95% of data (±2 SD)')
plt.axvline(mean + 2*std_dev, color='green', linestyle='--', linewidth=1)
plt.axvline(mean - 3*std_dev, color='red', linestyle='--', linewidth=1, label='99.7% of data (±3 SD)')
plt.axvline(mean + 3*std_dev, color='red', linestyle='--', linewidth=1)
# plot
plt.title('68-95-99.7 Rule')
plt.xlabel('Value')
plt.ylabel('Density')
plt.legend(loc='upper right')
plt.show()
Results:

The above shows how as we move away from the mean, more and more of the data is being covered in the distribution, corresponding to 68% for 1 standard deviation, 95% for two and 99.7% for three.
Next we will talk about conditional probability.
4.2. Bayes' Theorem
This theorem is one of the most important and yet under-appreciated probability theories out there. The idea is very simple – It provides a way to update the probability of a hypothesis based on a new evidence. Let's think about an example. Let's say I am meeting with my cousin named John today at 6PM. Now what do you think is the probability of John driving there in a Tesla? You can start making assumptions and then coming up with a probability. Now let's say I will give you a hint, which we will call an "evidence" in the literature of Bayes' Theorem, that John is 75 years old. How about if I told you that my cousin John is 15 years old and is in high school instead. How does each statement change your answer? As you can see, we will adjust the probability based on the new "evidence" provided to us. Bayes' Theorem formalizes this way of thinking in probability as follows:
P(A|B) = (P(B|A) * P(A)) / P(B)
- P(A|B): Probability of event A occurring given that B is true (posterior).
- P(B|A): Probability of event B occurring given that A is true.
- P(A): Probability of event A occurring (prior).
- P(B): Probability of event B occurring (evidence).
I have covered the use of Bayesian Optimization in hyperparameter optimization of machine learning problems in depth in the post below so I won't get into it here.
Hyperparameter Optimization with Bayesian Optimization – Intro and Step-by-Step Implementation…
Conclusion
In this post we talked about the importance of statistics, calculus, linear algebra and probability in day-to-day work of a data scientist. These are the four areas within Mathematics where a data scientist can build an excellent foundation, which will pay dividends during one'a career. In each of the four, we walked through some of the most important and frequently used concepts and then looked at examples together.
Thanks For Reading!
If you found this post helpful, please follow me on Medium and subscribe to receive my latest posts!
(All images, unless otherwise noted, are by the author.)