3 Subtle Mistakes That Can Undermine Your SQL Query Performance

Author:Murphy | View: 28245 | Time: 2025-03-23 18:38:27

Sql is a universal tool for anyone working with data-centric products. Whether you are a data analyst, data scientist, or software developer, you may need to write SQL queries to extract data from a relational database.

When writing queries, we tend to focus on getting the data we need. The query performance is usually the second concern. This is usually fine except for when the data size is quite large.

Besides, the database management systems create an execution plan to optimize the queries before actually executing them, which, in some cases, can be considered as improving a query.

However, we still need to keep in mind the query performance and should aim for writing more optimal and performant queries. Our application may not be suffering from query performance for now but the data we deal with can suddenly increase.

In this article, we will learn about 3 subtle mistakes that can degrade SQL query performance especially when working with large datasets. While these mistakes may not lead to incorrect results, and may indeed yield the desired data, there is room for substantial improvement in terms of both query efficiency and maintainability.

You can become a Medium member to unlock full access to my writing, plus the rest of Medium. If you already are, don't forget to subscribe if you'd like to get an email whenever I publish a new article.

Inefficient use of JOINs

SQL JOINs are used when we need to combine data from multiple tables. The query below demonstrates an inefficient use of JOINs.

SELECT * FROM employees e 
LEFT JOIN departments d ON e.department_id = d.department_id 
WHERE d.department_name = 'IT';

When we include a condition in the WHERE clause that references a column in the table on the "right" side of a LEFT JOIN (the table that may have unmatched rows), the effect is essentially the same as using an INNER JOIN, because the condition filters out the NULL values that result from unmatched rows.

We can update this query by preventing it from processing unmatched rows as follows:

SELECT * FROM employees e 
INNER JOIN departments d ON e.department_id = d.department_id 
WHERE d.department_name = 'IT';

Scan vs Seek

Scan and seek refer to two different ways of accessing data when processing a query. It helps you to optimize queries for better performance to know the difference between these two concepts.

A scan refers to reading the entire table or index to find the desired data, which can be slow especially for large tables. Where there are no suitable indexes, a query is more likely to perform a scan.

Consider we have an orders table with columns order id, customer id, and order date and we often use it to extract orders for a specific customer as follows:

SELECT * FROM orders WHERE customer_id = 123;

If there is no index on the customer id column, the database has to do a full table scan, which is inefficient.

A seek refers to using an index to quickly locate specific rows in the table, which is generally faster than a scan. Seeks are more likely to occur when the query can take advantage of an index to filter, join, or sort the data.

Here is how we can optimize the query above by creating an index on the customer id column:

CREATE INDEX idx_orders_customer_id ON orders (customer_id);

SELECT * FROM orders WHERE customer_id = 123;

Although indexes could be beneficial, keep in mind that unnecessary or improperly designed indexes can also negatively affect performance.

We can also create a multi-column (i.e. composite) index, which comes in handy when joining tables and filtering based on multiple columns.

Here is an example for creating a composite index:

-- Create a multi-column index on the 'hire_date' and 'salary' columns
CREATE INDEX idx_employees_hire_date_salary ON employees (hire_date, salary);

SELECT first_name, last_name, hire_date, salary
FROM employees
WHERE hire_date >= '2020-01-01' AND salary > 50000;

The use of subqueries

The use of a subquery, especially a correlated subquery, can often be replaced with a JOIN operation for better performance.

For instance, the following query includes a subquery.

SELECT * FROM orders 
WHERE order_id IN (SELECT order_id FROM order_items WHERE product_id = 5);

The WHERE clause is used for finding all orders in the order items table with a product id of 5. There is nothing wrong with this query in terms of extracting the data you need. However, since subqueries can often be slower than other constructs, particularly for large datasets, we can rewrite this query in a more efficient way using a JOIN:

SELECT o.* 
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
WHERE oi.product_id = 5;

The JOIN clause combines rows from orders and order items table based on the shared order id column. Then, the results are filtered to include only those rows where the product id in the order items table is 5.

The examples covered in this article demonstrate potential improvement areas for SQL queries. There is no guarantee that they apply to each and every query you write. It's always best to try out different versions of the query to find the most performant one.

The more you practice writing and testing SQL queries, the better you'll get at spotting and avoiding these subtle mistakes. By making small improvements, you can achieve significant gains in query performance and maintainability.

Thank you for reading. Please let me know if you have any feedback.

Tags: Artificial Intelligence Data Science Machine Learning Programming Sql