Ink to Insights: Comparing SQL and Python Queries using Bookshop Analytics

Author:Murphy  |  View: 22365  |  Time: 2025-03-23 12:49:18

SQL is the bread and butter of any data scientist's toolbox – the ability to quickly pull data from a data source for analysis is an essential skill for anyone working with large amounts of data. In this post I wanted to give some examples of a few basic queries I typically use in SQL, over the course of an EDA process. I'll compare these queries against similar scripts in Python which produce the same output, as a comparison between the two approaches.

For this analysis I'll be using some synthetic data on last year's highest rated books from a hypothetical chain of bookshops (the Total Fiction Bookstore). A link to the github folder for this project can be found here, where I go into the details of running the analysis.

Photo by Eugenio Mazzone on Unsplash

As a side note – while I mainly focus on the SQL queries in this article, it's worth noting that these queries can be integrated pretty seamlessly with Python using the pandaSQL library (as I've done for this project). This can be seen in detail in the Jupyter notebook on this project's GitHub link, but the structure of this query generally goes as follows:

query = """
SELECT * FROM DATA
"""

output = sqldf(query,locals())
output

PandaSQL is a very practical library for those who have more familiarity with SQL querying than the typical Pandas dataset manipulation – and is often much easier to read, as I will show here.

The Dataset

A snippet of the dataset can be seen below – there are columns for book title and the year it was published, the number of pages, the genres, the book's average rating, the author, number of units sold, and book revenue.

Synthetic data to be analysed (data by author)

Revenue Analysis by Decade

Let's say I want to know which decade has published the most profitable books for the bookshop. The original dataset doesn't have a column for which decade the books were published in – however this is relatively straightforward to input to the data. I run a subquery to divide the year using floor division and multiply by 10 to get the decade data, before aggregating and averaging the revenue by decade. I then order the results by total revenue to get the most profitable decades of published books in the bookshop.

WITH bookshop AS
(
SELECT TITLE, YEARPUBLISHED,
(YEARPUBLISHED/10) * 10 AS DECADE,
NUMPAGES, GENRES, RATING, AUTHOR, UNITSSOLD,
REVENUE
from df
)

SELECT DECADE, SUM(REVENUE) AS TOTAL_REVENUE,
ROUND(AVG(REVENUE),0) AS AVG_REVENUE
FROM bookshop
GROUP BY DECADE
ORDER BY TOTAL_REVENUE DESC

By comparison, an equivalent output in Python would look something like the code snippet below. I apply a lambda function which runs the floor division and outputs the decade, and from there I aggregate the revenue by decade and sort the result by total revenue.

# creating df bookshop
bookshop = df.copy()
bookshop['Decade'] = (bookshop['YearPublished'] // 10) * 10

# group by decade, agg revenue by sum and mean
result = bookshop.groupby('DECADE') 
                 .agg({'Revenue': ['sum', 'mean']}) 
                 .reset_index()

result.columns = ['Decade', 'Total_Revenue', 'Avg_Revenue']

# sorting by decade
result = result.sort_values('Total_Revenue')

Note the greater number of separate steps there are in the python script to achieve the same result – the functions are awkward and difficult to understand on first glance. By comparison the SQL script is much clearer in its presentation, and much easier to read.

I can now take this query and visualise it to get a sense of the book revenue trends across the decades, setting up a matplotlib graph using the following script – bar charts show the total revenue by decade, with a scatter plot on the secondary axis to show average book revenue.

# Creating primary y-axis (total revenue)
fig, ax1 = plt.subplots(figsize=(15, 9))
ax1.bar(agg_decade['DECADE'], agg_decade['TOTAL_REVENUE'], 
        width = 0.4, align='center', label='Total Revenue (Dollars)')
ax1.set_xlabel('Decade')
ax1.set_ylabel('Total Revenue (Dollars)', color='blue')

# Adjusting gridlines on the primary y-axis
ax1.grid(color='blue', linestyle='--', linewidth=0.5, alpha=0.5)

# Creating secondary y-axis (avg revenue)
ax2 = ax1.twinx()
ax2.scatter(agg_decade['DECADE'], agg_decade['AVG_REVENUE'], 
         marker='o', color='red', label='Avg Revenue (Dollars)')
ax2.set_ylabel('Avg Revenue (Dollars)', color='red')

# Adjusting gridlines on the secondary y-axis
ax2.grid(color='red', linestyle='--', linewidth=0.5, alpha=0.5)

# Setting the same y-axis limits for both ax1 and ax2
ax1.set_ylim(0, 1.1*max(agg_decade['TOTAL_REVENUE']))
ax2.set_ylim(0, 1.1*max(agg_decade['AVG_REVENUE']))

# Combining legends for both axes
lines, labels = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax2.legend(lines + lines2, labels + labels2, loc='upper left')

# Set title
plt.title('Total and Avg Revenue by Decade')

# Show the plot
plt.show()

The visualisation can be seen below – books published in the 1960s are apparently the most profitable for the bookshop, generating over $192,000 in revenue for Total Fiction Bookstore. By comparison, books on the list from the 1900s are more profitable on average, though didn't sell as well as books from the 1960s.

Total and average revenue by published decade (image by author)

Average book revenue follows a similar trend to total revenue across all decades of published books – with the exception of books from the 1900s and 1980s, which are more profitable on average but not overall.

Author Analysis

Now, suppose I want to get data on the top 10 authors in the list, ordered by their total generated revenue. For this query I want to know the number of books they've made that appear on the list, the total revenue they've generated on those books, their average revenue per book, and the average rating of those books in the bookshop. Simple enough question to answer using SQL – I can use a count statement __ to get the total number of books they've made, and avg statements to get the mean revenue and rating per author. Following that I can group these statements by director.

SELECT AUTHOR,
COUNT(TITLE) AS NUM_BOOKS,
SUM(REVENUE) AS TOTAL_REVENUE,
ROUND(AVG(REVENUE),0) AS AVG_REVENUE,
ROUND(AVG(RATING),2) AS AVG_RATING_PER_BOOK
FROM bookshop
GROUP BY AUTHOR
ORDER BY TOTAL_REVENUE DESC
LIMIT 10

An equivalent Python script would look like this – roughly the same length, but much more complex for the same output. I group the values by author before specifying how to aggregate each column in the agg function, then sorting the values by total revenue. Again, the SQL script is much clearer by comparison.

result = bookshop.groupby('Author') 
                 .agg({
                     'Title': 'count',
                     'Revenue': ['sum', 'mean'],
                     'Rating': 'mean'
                 }) 
                 .reset_index()

result.columns = ['Author', 'Num_Books', 'Total_Revenue', 
                  'Avg_Revenue', 'Avg_Rating_per_Book']

# Sorting for total revenue
result = result.sort_values('Total_Revenue', ascending=False)

# top 10
result_top10 = result.head(10)

The output from this query can be seen below – Ava Mitchell leads the field, with a total revenue of over $152,000 from her book sales. Emma Hayes takes 2nd place at over $85,000, with Liam Parker close behind at over $83,000.

Output of the book author query

Visualising this in matplotlib using the following script, we can generate bar plots of total revenue with data points showing the average book revenue per author. The average rating per author is also plotted on a secondary axis.

# Creating figure and axis
fig1, ax1 = plt.subplots(figsize=(15, 9))

#plotting bar chart of total revenue
ax1.bar(agg_author['Author'], agg_author['TOTAL_REVENUE'], 
        width=0.4, align='center', color='silver', label='Total Revenue (Dollars)')
ax1.set_xlabel('Author')
ax1.set_xticklabels(agg_author['Author'], rotation=-45, ha='left')
ax1.set_ylabel('Total Revenue (Dollars)', color='blue')

# Adjusting gridlines on the primary y-axis
ax1.grid(color='blue', linestyle='--', linewidth=0.5, alpha=0.5)

#creating scatter plot of avg revenue
ax1.scatter(agg_author['Author'], agg_author['AVG_REVENUE'], 
         marker="D", color='blue', label='Avg Revenue per Book (Dollars)')

# Creating scatter plot of avg rating on secondary axis
ax2 = ax1.twinx()
ax2.scatter(agg_author['Author'], agg_author['AVG_RATING_PER_BOOK'], 
         marker='^', color='red', label='Avg Rating per Book')
ax2.set_ylabel('Avg Rating', color='red')

# Adjusting gridlines on the secondary y-axis
ax2.grid(color='red', linestyle='--', linewidth=0.5, alpha=0.5)

# Combining legends for both axes
lines, labels = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(lines + lines2, labels + labels2, loc='upper right')

# Set title
plt.title('Top 10 Authors by Revenue, Rating')

# Show the plot
plt.show()

Running this, we get the following graph:

Top 10 authors by revenue and rating (image by author)

This graph does point to a pretty clear assertion – revenue doesn't correlate with average rating for each author. Ava Mitchell has the highest revenue but is at the median in terms of rating for the authors listed above. Olivia Hudson is highest by average rating while placing 8th in total revenue; there is no observable trend between an author's revenue and their popularity.

Comparing Book Length vs Revenue

Finally, let's assume I want to show how book revenue differs based on the book length. To answer this question, I first want to divide the books equally into 4 categories based on the book length quartiles, which will give a better idea of overall revenue vs book length trends.

Firstly, I define the quartiles in SQL, using a subquery to generate these values, before sorting the books into these buckets using a case when statement.

WITH PERCENTILES AS (
    SELECT 
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY NUMPAGES) 
        AS PERCENTILE_25,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY NUMPAGES) 
        AS MEDIAN,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY NUMPAGES) 
        AS PERCENTILE_75
    FROM bookshop
)
SELECT 
    TITLE, TITLE, REVENUE, NUMPAGES,
    CASE
        WHEN NUMPAGES< (SELECT PERCENTILE_25 FROM PERCENTILES) 
        THEN 'Quartile 1'
        WHEN NUMPAGES BETWEEN (SELECT PERCENTILE_25 FROM PERCENTILES) 
        AND (SELECT MEDIAN FROM PERCENTILES) THEN 'Quartile 2'
        WHEN NUMPAGES BETWEEN (SELECT MEDIAN FROM PERCENTILES) 
        AND (SELECT PERCENTILE_75 FROM PERCENTILES) THEN 'Quartile 3'
        WHEN NUMPAGES > (SELECT PERCENTILE_75 FROM PERCENTILES) 
        THEN 'Quartile 4'
    END AS PAGELENGTH_QUARTILE
FROM bookshop
ORDER BY REVENUE DESC

Alternatively (for SQL dialects that don't support percentile functions, like SQLite), the quartiles can be calculated separately before inputting them manually to the case when statement.

--For SQLite dialect
SELECT TITLE, REVENUE, NUMPAGES,
CASE
WHEN NUMPAGES < 318 THEN 'Quartile 1'
WHEN NUMPAGES BETWEEN 318 AND 375 THEN 'Quartile 2'
WHEN NUMPAGES BETWEEN 375 AND 438 THEN 'Quartile 3'
WHEN NUMPAGES > 438 THEN 'Quartile 4'
END AS PAGELENGTH_QUARTILE
FROM bookshop

ORDER BY REVENUE DESC

Running this same query in Python, I define the percentiles using numpy before using the cut function to sort the books into their buckets, then sorting the values by book langth in pages. As before, this process is noticeably more complex than the equivalent script in SQL.

# Define the percentiles using numpy
percentiles = np.percentile(bookshop['NumPages'], [25, 50, 75])

# Define the bin edges using the calculated percentiles
bin_edges = [-float('inf'), *percentiles, float('inf')]

# Define the labels for the buckets
bucket_labels = ['Quartile 1', 'Quartile 2', 'Quartile 3', 'Quartile 4']

# Create the 'RUNTIME_BUCKET' column based on bin edges and labels
bookshop['RUNTIME_BUCKET'] = pd.cut(bookshop['NumPages'], bins=bin_edges, 
                                labels=bucket_labels)

result = bookshop[['Title', 'Revenue', 
               'NumPages', 'PAGELENGTH_QUARTILE']].sort_values(by='NumPages', 
                                                          ascending=False)

The output to this query can be visualised as boxplots using seaborn – a snippet of the script used to generate the boxplots can be seen below. Note that the runtime buckets were manually sorted into the correct order to have them properly presented.

# Set the style for the plots
sns.set(style="whitegrid")

#Setting order of profit buckets
pagelength_bucket_order = ['Quartile 1', 'Quartile 2', 
                        'Quartile 3', 'Quartile 4']

# Create the boxplot
plt.figure(figsize=(16, 10))
sns.boxplot(x='PAGELENGTH_QUARTILE', y='Revenue', 
            data=pagelength_output, order = pagelength_bucket_order, 
            showfliers=True)

# Add labels and title
plt.xlabel('PageLength Quartile')
plt.ylabel('Revenue (Dollars)')
plt.title('Boxplot of Revenue by PageLength Bucket')

# Show the plot
plt.show()

The boxplots can be seen below -note the median revenue for each book length quartile trends upwards as the books get longer. This suggests that longer books are more profitable at the bookshop.

Boxplot of revenue by book length quartile (image by author)

Additionally the range of the 4th quartile is much wider compared to the other quartiles, indicating there's more variation in the price point for larger books.

Final Thoughts and Further Applications

To conclude, the use of SQL for Data Analysis queries is typically much more straightforward than using equivalent operations in Python; the language is easier to write than Python queries, while broadly capable of producing the same results. I wouldn't argue that either is better than the other – I've used a combination of both languages in this analysis – rather, I believe that using a combination of both languages together can produce more efficient and effective data analysis.

Therefore, given the higher clarity in writing SQL queries over queries in Python, I think it's much more natural to use this when performing the initial EDA for a project. SQL is far easier to read and write as I've shown in this article, making it especially advantageous for these early exploratory tasks. I often use it when beginning on a project, and I'd recommend this approach to anyone who already has a decent grasp of SQL querying.

Tags: Data Analysis Data Visualization Python Sql Tips And Tricks

Comment