When Are Songwriters Most Successful?

Author:Murphy  |  View: 28106  |  Time: 2025-03-23 12:59:45
A singer receiving a Grammy Award amid a flurry of confetti (created with the assistance of DALL-E2)

At what age are singer-songwriters most successful? I wondered this the other day when I heard an old Stevie Wonder song. My impression was that, like mathematicians, singer-songwriters peak in their mid-late 20s. But what does the data say?

In this Quick Success Data Science project, we'll use Python, pandas, and the Seaborn plotting library to investigate this question. We'll look at the careers of 16 prominent singer-songwriters with over 500 hits among them. We'll also incorporate an attractive graphic known as the kernel density estimate plot into the analysis.

Methodology

To determine when songwriters are most successful, we'll need some guidelines. The plan is to examine:

  • Singer-songwriters including those who work with co-writers.
  • Singer-songwriters with decades-long careers.
  • A diverse selection of singer-songwriters and musical genres.
  • Singer-songwriters on the Billboard Hot 100 chart.

The Hot 100 is a weekly chart, published by Billboard magazine, that ranks the best-performing songs in the United States. The rankings are based on physical and digital sales, radio play, and online streaming. We'll use it as a consistent and objective way to judge success.

The Data

We'll use songs written by the following highly successful artists:

List of singer-songwriters used in this project (all remaining images by the author).

I've recorded the age of each artist at the time of each of their hits and saved it as a CSV file stored on this Gist. If they had multiple hits in the same year, their age entry was repeated for each hit. Here's a glimpse at the top of the file:

The first few rows of the CSV file.

Cross-referencing this information is tedious (ChatGPT refused to do it!). Consequently, a few hits written by these artists but performed by others may have been inadvertently excluded.

Kernel Density Estimate Plots

A kernel density estimate plot is a method – similar to a histogram – for visualizing the distribution of data points. While a histogram bins and counts observations, a KDE plot smooths the observations using a Gaussian kernel. This produces a continuous density estimate where the y-values are normalized so that the total area under the curve equals one.

The following figure compares the two methods. How well they capture the underlying data depends on how the histogram is binned and how the KDE plot is smoothed.

A KDE plot (curve) vs. a histogram (bars) for a series of observations (dots).

Unlike histograms, which don't differentiate where a sample falls within a bin, a KDE plot draws a small Gaussian bell curve over each individual sample. These bell curves are then summed together to form the final curve. This makes KDE plots wider than histograms, with an underlying assumption that the data extends smoothly toward the extremes. Thus, KDE plots won't stop abruptly at zero, even if that's a hard limit to the data.

The final KDE curve is built from bell curves over each individual data point.

KDE plots use bandwidth for the kernel smoothing process. Selecting the proper bandwidth is both important and something of an art. The smaller the bandwidth, the more closely the KDE plot honors the underlying data. The wider the bandwidth, the more the data is averaged and smoothed.

Example of bandwidth smoothing in a KDE plot.

Narrow bandwidths can produce rugose curves that defeat the purpose of using a KDE plot in the first place. They can also introduce random noise artifacts. On the other hand, wide bandwidths (such as 2) can smooth too much, causing important features of the data distribution, like bimodality, to be lost.

The impact of applying different bandwidth adjustments to the final curve.

While the seaborn library's kdeplot() method uses good defaults for generating plots, you'll probably want to play with the bw_adjust parameter to tune the KDE plot to the story you're trying to tell.

So why use a KDE plot when there's a perfectly good histogram sitting right there? Here are some reasons:

  • KDE plots are less cluttered than histograms and much more readable when overlaying multiple distributions in a single figure.
  • KDE plots can let you see patterns in data (such as central tendency, bimodality, and skew) that may be obscured in a histogram view.
  • Similar to sparklines, KDE plots are good for "quick looks" and quality control.
  • KDE plots are arguably more attractive than histograms and make a better choice for infographics aimed at the general public.
  • KDE plots facilitate easy comparisons between subsets of data.

For more on KDE plots, see the kdeplot() docs and the Kernel Density Estimation section of seaborn's visualization guide.

Installing Libraries

For this project, we'll need to install seaborn for plotting and pandas for data analysis. You can install these libraries as follows:

With conda: conda install pandas seaborn

With pip: pip install pandas seaborn


The Code

The following code was written in JupyterLab and is described by cell.

Importing Libraries and Loading the Data

After importing the libraries, we'll select the seaborn "whitegrid" style so that our plots have a consistent look. We'll then use pandas to read the CSV file.

If you're using a virtual environment, note that NumPy and Matplotlib are dependencies of pandas, so there's no need to install them independently.

Because some artists had more hits than others and DataFrames need to be rectangular, some columns will be assigned missing values. We'll replace these NaN values with zeros using the fillna() method.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

sns.set(style='whitegrid')

# Load the data and set NaN values to zero:
df = pd.read_csv('https://bit.ly/3E8Q1BO').fillna(0)
df.head(3)
The head of the initial DataFrame.

Melting the DataFrame

Our DataFrame is currently in "wide" format. Every artist's name is a column header, and their ages are the column values. Python's plotting libraries tend to prefer "long" formats. This means that the artists' names should be row values and should be repeated for each corresponding age.

For convenience, pandas includes a method called melt() that converts from wide to long format. The var_name argument is used to set a new column, called "Name," to hold our previous column names. The value_name argument indicates that the previous column values should now be under a column named "Age."

In addition, we'll add a new column named "Color," which we'll set to "red." This is a handy way to group the data for later plotting. We'll also filter out Age values equal to zero so that we can get an accurate count of hit songs per artist.

# Melt the DataFrame to a long format for plotting by artist name:
melted_df = pd.melt(df, var_name='Name', value_name='Age')

# Make a column for the plotting color:
melted_df['Color'] = 'red'

# Filter out zero values:
melted_df = melted_df[melted_df['Age'] != 0]

melted_df.tail(3)
The tail of the long format melted DataFrame.

Plotting the Number of Hits per Artist

Next, we'll use pandas' value_counts() method to count the number of Hot 100 hits per artist. As part of the process, we'll sort the data in descending order.

# Calculate the order of the bars by counts:
order = melted_df['Name'].value_counts().index

# Plot a bar chart of the number of hits by each artist:
ax = sns.countplot(data=melted_df, 
                   x='Name', 
                   color='red', 
                   order=order)
ax.set_xticklabels(ax.get_xticklabels(), 
                   rotation=55, 
                   fontsize=10, 
                   ha='right')
ax.set_title('Number of Billboard Hot 100 Hits');
The number of Hot 100 hits by each artist over their career.

Plotting Age at the Time of Last Hit and Total Time Between Hits

Next, we'll plot how old each artist was at the time of their last Hot 100 hit, along with the time between the first and last appearance on the chart.

Since we'll need to find each artist's minimum age, we'll first set 0 values in the df DataFrame to NaN, so they will be ignored. Otherwise, 0 would be picked as the minimum age. We won't sort the results as we want to compare charts side-by-side, and thus want the artist names to remain in the same order.

# Replace 0 values with NaN in order to find minimum age statistic:
df = df.replace(0, np.NaN)

# Calculate maximum age for each column
max_age = df.max()

# Calculate age span (maximum age - minimum age) for each column
age_span =df.max() - df.min()

# Create subplots for two bar charts
fig, axes = plt.subplots(nrows=1, ncols=2, 
                         figsize=(8, 6))

# Plot artist's age at time of last hit:
sns.barplot(x=max_age.values, 
            y=max_age.index, 
            ax=axes[0], 
            color='red', 
            orient='h')
axes[0].set_title('Artist Age at Time of Last Hit')
axes[0].set_xlabel('Age at Last Hit')
axes[0].set_ylabel('Artist')

# Plot age span between hits:
sns.barplot(x=age_span.values, 
            y=age_span.index, 
            ax=axes[1], 
            color='red', 
            orient='h')
axes[1].set_title("Years Between Artist's First & Last Hits")
axes[1].set_xlabel('Years Between First & Last Hit')
axes[1].set_ylabel('')

plt.tight_layout();
Comparison of each artist's age at the time of their last hit with the timespan between hits.

Takeaways from these charts are that more than half of the artists charted all their hits before age 50 and charted over timespans of 30 years or less.

Calculating Age Statistics

Now, let's find the age when the artists were most successful. Since a single statistic can't necessarily capture this, we'll use pandas' mean(), median(), and mode() methods.

# Calculate the statistics:
mean_age = round(melted_df.mean(numeric_only=True).mean(), 1)
median_age = round(melted_df.median(numeric_only=True).median(), 1)
mode_age = round(melted_df.mode(numeric_only=True).mode(), 1)

print(f"Mean age = {mean_age}")
print(f"Median age = {median_age}")
print(f"Mode age = {mode_age}")

Of the 16 singer-songwriters under investigation, the successful "sweet spot" appears to be around 29–33 years old.

Note: An assumption in this analysis is that songs are written in the same year that they appear on the Hot 100 chart. Because there'll be a lag time between when the song is written and when it appears on the chart, there may be a slight bias to older ages in our statistics.

Finding the Success Sweet Spot Using a KDE Plot

A visual way to find the sweet spot is with a KDE plot. We'll use seaborn's kdeplot() method and pass it the melted DataFrame. We'll also set the hue argument to the "Color" column, which means it will ignore the artist's name and plot all the age values as a single group.

# Create a KDE plot for complete dataset:
ax = sns.kdeplot(data=melted_df, 
                 x='Age', 
                 hue='Color', 
                 fill=True, 
                 palette='dark:red_r', 
                 legend=False)

# Set x-axis tick positions:
ax.set_xticks(range(0, 85, 5))  

ax.set_title('KDE Plot of Billboard Hot 100 Hits by Age For All Artists',
             fontsize=14);
A KDE plot of the "Age" column for all the artists.

This plot confirms what we learned earlier, that the greatest success peaks at just over 30 years. The blip after 70 years represents Paul McCartney's collaboration with Kanye West in 2015 when McCartney was 73 years old.

Comparing Careers with a Facet Grid

Facet grids are ways to create multiple plots with shared axes that display different subsets of a dataset. We're going to make a facet grid that uses KDE plots to compare each artist's hit distribution.

We'll start by calling seaborn's FacetGrid() method and passing it the melted DataFrame and its "Name" column. We'll want to compare all the artists with no distractions, so we'll use the same color, designated by the hue argument, for each curve. By setting the col_wrap argument to 2, we split the display into 2 columns with 8 curves in each.

With the facet grid defined, we'll call the kdeplot() method and map it to the facet grid, designated as the g variable. We'll set the bandwidth adjustment (bw_adjust) to 0.4 so that we don't smooth out all the variability in the data.

# Plot a Facet Grid of each artist's hits vs. age as a KDE chart:
g = sns.FacetGrid(data=melted_df, 
                  col='Name', 
                  hue='Color', 
                  aspect=3, 
                  height=1.00, 
                  col_wrap=2, 
                  palette='dark:red_r')

g.map(sns.kdeplot, 'Age',
      fill=True, 
      alpha=0.7, 
      lw=2, 
      bw_adjust=0.4)

g.set_axis_labels('Age', 'Density')
g.set_titles(col_template='{col_name}')
g.tight_layout()

# Loop through each subplot to set custom x-axis tick labels:
for ax in g.axes.flat:
    ax.set_xticks(range(0, 80, 10))
    ax.set_xticklabels(range(0, 80, 10))

# Add a title to the FacetGrid
g.fig.suptitle('Billboard Hot 100 Hits vs. Age', 
               y=1.03,
               fontsize=16);
A facet grid of each artist's KDE plot.

What a wonderful chart! Sleek and packed with information. This is where KDE plots come into their own.

With just a glance you can see Sting's early success with The Police followed by a successful solo career. Johnny Cash's bimodal distribution mirrors his struggle with drug addiction. Paul Simon's later career success with the Graceland album appears as a blip at age 45. And, as we saw in our earlier analysis, most of the peaks tend to cluster around ages 29–34.

Plotting a Stacked KDE Plot

Another way to tell a story with KDE plots is to stack them in the same panel. This always works better than stacking histograms.

For our current project, there are too many artists for this to work well at an individual level. But since our goal is to highlight the "success sweet spot" for all artists, it does an adequate job.

To stack the KDE plots, we just need to call the kdeplot() method without using the facet grid. An important parameter here is common_norm, which stands for "common normalization."

According to seaborn's documentation, "When common_norm is set to True, all the KDE curves will be normalized together using the same scale. This can be useful when you want to compare the overall distribution shapes of different groups. It's particularly helpful when you have multiple groups with different sample sizes or different ranges of values, as it ensures that the curves are directly comparable in terms of their shapes."

We definitely have different sample sizes per artist and want to compare curves, so we'll set common_norm to True.

# Create a stacked KDE plot:
fig, ax = plt.subplots(figsize=(10, 6))
ax = sns.kdeplot(data=melted_df, 
                 x='Age', 
                 hue='Name', 
                 fill=True, 
                 palette='dark:red_r', 
                 common_norm=True)
ax.set_xticks(range(0, 85, 5))
ax.set_title('Stacked KDE Plot by Artist')
ax.set_xlabel('Age')
ax.set_ylabel('Density');
Stacked KDE plots for each artist with common normalization set to "True".

While it's difficult if not impossible to identify the curve for specific artists (even if you use a different color per artist), it's pretty clear that the optimum age is around 30 years.

If you're curious, here's what the plot looks like with common_norm set to False:

Stacked KDE plots per artist with common normalization set to "False".

Plotting a Distribution Plot

Finally, let's visualize the data as a distribution plot. Seaborn's displot() method provides a figure-level interface for drawing distribution plots onto a seaborn FacetGrid. It lets you choose multiple plot types, such as KDEs and histograms, with the kind parameter.

Another nice feature is the inclusion of a "rug plot," added using the rug parameter. A rug plot marks the location of data points along an axis. This lets you see the limits of the actual data, which may be obscured by the "tails" of the KDE plot.

# Plot the distribution of hits vs age:
sns.displot(data=melted_df,
            x='Age', hue='Name',
            kind='kde', height=6,
            multiple='fill', clip=(0, None),
            palette='dark:red_r', 
            common_norm=True, rug=True);
A Distribution Plot including both KDE and rug plots (image by the author)

I personally find this plot difficult to parse, but it does highlight the peak success years around age 30.


Conclusion

KDE plots, with their smooth, pleasing shapes, are a great way to visualize univariate data. While functionally similar to histograms, they make it easier to see patterns in the data and to stack and compare multiple distributions in the same figure.

With the aid of KDE plots, we were able to show that singer-songwriters are most successful around the age of 30. So, if you want to start a career as a singer-songwriter, don't put it off!

Thanks!

Thanks for reading and please follow me for more Quick Success Data Science projects in the future.

Tags: Billboard Hot 100 Data Visualization Kernel Density Estimation Python Programming Songwriting

Comment