SQL Knowledge You Need For Data Science
According to an article by 365DataScience, which surveyed 1,000 LinkedIn Data Science job postings, 60% required SQL.
What does this tell us?
Well, SQL is an essential skill for a data scientist to have if they want to increase their chance of getting a job.
In this article, I will discuss the SQL knowledge you should have to land an entry-level data science role and provide some resources and advice that helped me when I was learning SQL.
Knowledge
For many entry-level data science jobs, from my experience, you don't need to be an expert in SQL; you just need to know how to query and get the data you need to analyse and build your machine-learning models.
For example, creating ETL (extract transform load) pipelines and managing databases are typically outside a data scientist's remit and are more the work of data engineers. However, that shouldn't stop you from learning these types of things if they interest you!
Overview
To start with, you should gain an understanding of relational databases and the different types of SQL flavours. I am not saying you need to learn the flavours; just understand the differences between them and why certain ones exists. Treat this as some preliminary reading before diving into the main content!
Based on my experience, I recommend learning MySQL or PostgreSQL as these are the ones used most throughout industry. However, most syntax and functions across all SQL flavours are the same, so I wouldn't worry too much about this. If you learn one, the others are pretty easy to pick up, so don't overthink it.
See this article if you are interested in learning the differences between SQL flavours.
Basic Functions
To be honest, I use the basic SQL functions 95% of the time when querying in my day job. As I said before, as a data scientist, I mainly use SQL to get data and do basic transformations.
I actually do more sophisticated transformations and data manipulation in Pandas, as it's easier to test along the way through unit and integration tests.
Anyway, the primary and most important SQL commands you should know are:
For all these commands, I have provided examples of how you perform them using an imaginary
Employees
table.
*SELECT FROM** (standard query)
SELECT * FROM Employees
This command selects columns from a table. The *
command gets all columns from the table Employees
.
ALTER, INSERT, CREATE (modify tables)
-- CREATE table
CREATE TABLE Employees (
EmployeeID INT PRIMARY KEY,
FirstName NVARCHAR(50),
LastName NVARCHAR(50),
BirthDate DATE,
Salary DECIMAL(10, 2)
);
-- INSERT a new record into the Employees table
INSERT INTO Employees (EmployeeID, FirstName, LastName, BirthDate, Salary)
VALUES (1, 'John', 'Doe', '1980-01-01', 50000.00);
-- ALTER the Employees table to add a new column
ALTER TABLE Employees
ADD Department VARCHAR(50);
Here, we create a table called Employees
__ and define the columns along with their primary key. Then, we add a new row to this data frame using INSERT INTO
. Finally, we ALTER
the table to add a new column called Department.
GROUP BY, ORDER BY
-- GROUP BY
SELECT Department, COUNT(*) AS NumberOfEmployees
FROM Employees
GROUP BY Department;
-- ORDER BY
SELECT * FROM Employees
ORDER BY LastName ASC, FirstName ASC;
These commands are pretty self explanatory. The ORDER BY
command just sorts the table a column and GROUP BY
just aggregates by a column. Here, we do an employee count by department.
WHERE, AND, OR, BETWEEN, IN, HAVING (filter tables)
-- WHERE, AND, and OR
SELECT * FROM Employees
WHERE Salary > 40000 AND Department = 'Sales';
-- BETWEEN
SELECT * FROM Employees
WHERE BirthDate BETWEEN '1970-01-01' AND '1990-12-31';
-- IN
SELECT * FROM Employees
WHERE Department IN ('Sales', 'HR');
-- GROUP BY and HAVING
SELECT Department, AVG(Salary) AS AverageSalary
FROM Employees
GROUP BY Department
HAVING AVG(Salary) > 45000;
The WHERE
and HAVING
commands filter the dataframe by a column. In this case we are filtering by salary, birth day and department.
AVG, COUNT, MIN, MAX, SUM (aggregate functions)
-- AVG, COUNT, MIN, MAX, SUM
SELECT
AVG(Salary) AS AverageSalary,
COUNT(EmployeeID) AS NumberOfEmployees,
MIN(Salary) AS MinimumSalary,
MAX(Salary) AS MaximumSalary,
SUM(Salary) AS TotalSalary
FROM Employees;s
These create statistics from the dataset, the return of this is just one row with the average salary, total employees, minimum salary, maximum salary and total salary.
DISTINCT
SELECT DISTINCT Department FROM Employees;
If there are duplicate entries in a column, the DISTINCT
command will get the unique values only.
DATEADD, DATEDIFF, DATEPART (date and time functions)
-- DATEADD: Add one year to the birthdate
SELECT EmployeeID, FirstName, LastName, DATEADD(year, 1, BirthDate) AS BirthDatePlusOneYear
FROM Employees;
-- DATEDIFF: Calculate age of employees
SELECT EmployeeID, FirstName, LastName, DATEDIFF(year, BirthDate, GETDATE()) AS Age
FROM Employees;
-- DATEPART: Extract the year of birth
SELECT EmployeeID, FirstName, LastName, DATEPART(year, BirthDate) AS BirthYear
FROM Employees;
Here the DATEADD
is adding one year to the birth day, DATEDIFF
is getting the year difference between an employees birth day and current date, and DATEPART
gets the year of birth from the birthday.
CASE (basically if-else statement in other languages)
SELECT EmployeeID, FirstName, LastName,
CASE
WHEN Salary > 60000 THEN 'High'
WHEN Salary BETWEEN 40000 AND 60000 THEN 'Medium'
ELSE 'Low'
END AS SalaryCategory
FROM Employees;
The CASE
acts like a conditional if-else. In this case (no pun intended), if salary is over 60,000 then its "high", between 40,000 and 60,000 its "medium" and else its "low."
To be clear, these completely arbitrary filters I made shouldn't be taken seriously.
FULL JOIN, LEFT JOIN, RIGHT JOIN, INNER JOIN, UNION(all the types of joins)
-- INNER JOIN
SELECT e.EmployeeID, e.FirstName, e.LastName, d.DepartmentName
FROM Employees e
INNER JOIN Departments d ON e.DepartmentID = d.DepartmentID;
-- LEFT JOIN
SELECT e.EmployeeID, e.FirstName, e.LastName, d.DepartmentName
FROM Employees e
LEFT JOIN Departments d ON e.DepartmentID = d.DepartmentID;
-- RIGHT JOIN
SELECT e.EmployeeID, e.FirstName, e.LastName, d.DepartmentName
FROM Employees e
RIGHT JOIN Departments d ON e.DepartmentID = d.DepartmentID;
-- FULL JOIN
SELECT e.EmployeeID, e.FirstName, e.LastName, d.DepartmentName
FROM Employees e
FULL JOIN Departments d ON e.DepartmentID = d.DepartmentID;
-- UNION
SELECT FirstName, LastName FROM Employees
UNION
SELECT FirstName, LastName FROM Managers;
The JOIN
commands simply joins two tables together on certain columns and id's that are common between the two tables. See here for a visual explanation of the different types and what they are doing under the hood.
The JOIN operations are probably the most tricky, so ensure you understand them! I still make mistakes and had a bad experience with them in my first job, but I have gotten a lot better at catching them. So, don't be like me and learn them thoroughly!
Knowing all these basic function will probably enable you to pass any entry-level data science interview that may contain an SQL exercise.
Advanced Functions
Once you master the basics, investing more time in learning some of the more advanced functions is worthwhile. I use some of these regularly in my day job, so they come in handy, particularly when you build big datasets because the compute time in a SQL warehouse is lightyears quicker than in Python.
Common Expression Tables (CTE) and SubQueries
-- Common Table Expression (CTE)
WITH SalesCTE AS (
SELECT EmployeeID, SUM(SalesAmount) AS TotalSales
FROM Sales
GROUP BY EmployeeID
)
SELECT e.EmployeeID, e.FirstName, e.LastName, s.TotalSales
FROM Employees e
JOIN SalesCTE s ON e.EmployeeID = s.EmployeeID;
-- Subquery
SELECT EmployeeID, FirstName, LastName, Salary
FROM Employees
WHERE Salary > (SELECT AVG(Salary) FROM Employees);
What's going on here is that we are creating a new table called SalesCTE
and then we are joining this new table onto Employees
. This is an example of a CTE.
The subquery is not creating a new table but instead pulling a value from a table to filter on. In this case, it is the average salary.
User Defined Functions (UDFs)
CREATE FUNCTION dbo.GetEmployeeFullName(@EmployeeID INT)
RETURNS NVARCHAR(100)
AS
BEGIN
DECLARE @FullName NVARCHAR(100);
SELECT @FullName = FirstName + ' ' + LastName
FROM Employees
WHERE EmployeeID = @EmployeeID;
RETURN @FullName;
END;
-- Using the UDF
SELECT dbo.GetEmployeeFullName(EmployeeID) AS FullName
FROM Employees;
UDFs work very similar to regular functions in other languages. Here, we create a function that gets the employee's full name and returns it in a single column.
Window Functions (RANK, ROW_NUMBER, DENSE_RANK)
-- ROW_NUMBER
SELECT EmployeeID, FirstName, LastName, Salary,
ROW_NUMBER() OVER (PARTITION BY Department ORDER BY Salary DESC) AS RowNum
FROM Employees;
-- RANK
SELECT EmployeeID, FirstName, LastName, Salary,
RANK() OVER (PARTITION BY Department ORDER BY Salary DESC) AS Rank
FROM Employees;
-- DENSE_RANK
SELECT EmployeeID, FirstName, LastName, Salary,
DENSE_RANK() OVER (PARTITION BY Department ORDER BY Salary DESC) AS DenseRank
FROM Employees;
Window functions produce calculations across multiple rows that are related to the current row we are analysing.
So in this case, we use ROW_NUMBER()
to number the employees by their salary in their corresponding department.
RANK()
does a similar thing and ranks employees by their salary in each department.
DENSE_RANK()
is similar again, but if two employees share the same salary, their rank is the same.
String Operations and Regex
-- String operations: CONCAT, SUBSTRING, REPLACE, etc.
SELECT
CONCAT(FirstName, ' ', LastName) AS FullName,
SUBSTRING(FirstName, 1, 1) AS FirstInitial,
REPLACE(LastName, 'a', 'o') AS ModifiedLastName
FROM Employees;
-- Regex using LIKE
SELECT EmployeeID, FirstName, LastName
FROM Employees
WHERE FirstName LIKE 'J%'; -- First name starts with 'J'
-- Regex using PATINDEX (SQL Server)
SELECT EmployeeID, FirstName, LastName
FROM Employees
WHERE PATINDEX('%[0-9]%', LastName) > 0; -- Last name contains a digit
CONCAT
combines two strings together, SUBSTRING
filters a string and REPLACE
replaces letters in a string.
Through LIKE 'J%'
gets an value that starts with J in the corresponding column.
And finally, PATINDEX('%[0-9]%', LastName) > 0
This is a pattern index search to find the first occurrence of any number between between 0 and 9 in LastName
.
There are plenty of other things to learn, but these advanced techniques are the ones I most often use as a data scientist. I am sure other things come up depending on your industry and organisation.
SQL is not a massive language, but there are still loads of functions. If you are interested, see here for the complete list.
Extra Things
If you have more time on your hands, you can try to learn the infrastructure around SQL, such as SQL servers, database management, data warehousing, scaling data platforms, query optimisation and building ETL pipelines.
Most of the list above is part of the data engineering role. If that interests you, this blog provides an excellent roadmap for how you can become a data engineer.
Honestly, I don't have an in-depth understanding of these topics because they are things data engineers look after more than data scientists. But if they interest you, feel free to learn them! I also plan to cover them sometime in the future, as it will make me a more well-rounded data scientist and increase my value. But, as I always say, you can't learn everything!
Resources
Regarding resources, I used W3Schools and Tutorialspoint when I was learning SQL. They are completely free and really cover the basics well. I like to think of them more as reference texts than a complete course, but they are a great entry point if you are starting out and contain exercises for you to practise.
If you want to practise more outside the courses, then websites like Hacker Rank and Leetcode are great. I did about ~50 SQL questions on Hacker Rank before I felt comfortable answering entry-level interview questions.
I wouldn't worry too much about this number; it may take you longer or shorter than 50 questions; focus on understanding, and you will know when you feel comfortable.
Advice
The three main bits of advice I would give to someone starting to learn SQL is:
- Don't worry or waste too much time picking the "right" course; choose one you like and start. As a beginner, any introductory course will be beneficial and will most likely cover the same topics.
- Practise daily and make sure you do exercise problems to understand SQL; the 10,000-hour rule (10,000 iteration rule) applies here as it does to everything else.
- Have patience, and I promise everything will eventually sink in. You won't learn everything or master SQL in a single day, but you can learn enough SQL knowledge in 2 weeks to land an entry-level position. This is precisely what I did, and you can read my entire SQL journey in the post below.
I hope this article has given you the knowledge you need in SQL to land that entry-level data science role! The SQL skills you need are less extensive than those in Python, and they should take you less time to learn as SQL is slightly more manageable and a smaller language, but it is still essential nonetheless.
Another Thing!
I have a free newsletter, Dishing the Data, where I share weekly tips for becoming a better Data Scientist, my general experience in the field, and some thoughts I have had over the past week.
Connect With Me!
- LinkedIn, X (Twitter), or Instagram.
- My YouTube Channel to learn technical data science and Machine Learning concepts!