Exploratory Data Analysis: What Do We Know About YouTube Channels (Part 2)

Author:Murphy | View: 26530 | Time: 2025-03-22 23:56:51

In the first part of the story, I collected statistical data from about 3000 YouTube channels and got some interesting insights. In this part, I will go a bit deeper, from the generic "channel" to the individual "video" level. I will show how to collect data about YouTube videos and what kind of insights we can get.

Methodology

To collect data about YouTube videos, we need to perform several steps:

Get credentials for the YouTube Data API. It's free, and the API limit of 10,000 requests per day is enough for our task.
Find several YouTube channels that we want to analyze.
Write some Python code to get the latest videos and their stats for a selected channel. YouTube analytics is available only for channel owners, and we can only get data at the current moment. But we can run the code for some time. In my case, I collected data for three weeks using Apache Airflow and a Raspberry Pi.
Perform the data analysis. I will be using Pandas, Matplotlib, and Seaborn for that.

Getting the YouTube API credentials and configuring the Apache AirFlow were described in my previous articles, and I recommend readers pause this one and read that part first:

Exploratory Data Analysis: What Do We Know About YouTube Channels

And now, let's get started.

1. Getting the data

To get information about YouTube videos, I will use a python-youtube library. Surprisingly, there is no ready-to-use method to get the list of videos from a specific channel, and we need to implement it on our own.

First, we need to call the get_channel_info method, which, as its name suggests, will return us the basic information about the channel.

from pyyoutube import Api

def get_channel_info(api: Api, channel_id: str) -> Tuple[str, str, str]:
    """ Get info about the channel. Return values: title, uploads, subscribers """
    channel_info = api.get_channel_info(channel_id=channel_id, parts=["snippet", "statistics", "contentDetails"], return_json=True)
    if len(channel_info["items"]) > 0:
        item = channel_info["items"][0]
        title = item["snippet"]["title"]
        uploads = item["contentDetails"]["relatedPlaylists"]["uploads"]
        subscribers = item["statistics"]["subscriberCount"]
        return title, uploads, subscribers

    logging.warning(f"get_channel_info::warning cannot get data for the channel {channel_id}")
    return None, None, None

api = Api(api_key="...")
get_channel_info(api, channel_id="...")

The output looks like this:

 "items": [
    {
      "id": "UCBJycsmd...",
      "snippet": {
        "title": "Mar...",
        "description": "MKBH...",
        "publishedAt": "2008-03-21T15:25:54Z",
      "contentDetails": {
        "relatedPlaylists": {
          "likes": "",
          "uploads": "UUBJy..."
        }
      },
      "statistics": {
        "viewCount": "3845139099",
        "subscriberCount": "17800000",
        "hiddenSubscriberCount": false,
        "videoCount": "1602"
      }
    }
  ]

Here, we have a statistics part, containing the number of videos, views, and subscribers for a channel. The second section is contentDetails; it is what we need because it contains an ID for the "uploads" list. As we can see, channel uploads are stored as a "virtual" playlist, which was a bit surprising for me.

After that, we need to call the get_playlist_items method, which returns us a list of videos from the required playlist.

def get_playlist_items(api: Api, playlist_id: str, limit: int) -> List[Tuple[str, str]]:
    """ Get video IDs for a playlist """
    videos = []
    playlist_items = api.get_playlist_items(playlist_id=playlist_id, count=10, limit=10, parts=["contentDetails"], return_json=True)
    next_page_token = playlist_items["nextPageToken"]
    while next_page_token is not None:
        for video in playlist_items["items"]:
            video_id = video["contentDetails"]["videoId"]
            video_published_at = video["contentDetails"]["videoPublishedAt"]
            # views, likes, comments = get_video_by_id(api, video_id)
            videos.append([video_id, video_published_at])

        next_page_token = playlist_items["nextPageToken"]
        playlist_items = api.get_playlist_items(playlist_id=playlist_id, count=10, limit=10, 
                                                parts=["contentDetails"], return_json=True,
                                                page_token=next_page_token)
        if len(videos) >= limit:
            break

    return videos

The output looks like this:

"items": [
            {
                "kind": "youtube#playlistItem",
                "etag": "tmSJMm9_KwkNTPkpdspUkQiQtuA",
                "id": "VVVCSnljc21kdXZZRU...",
                "contentDetails": {
                    "videoId": "Ks_7TmG...",
                    "videoPublishedAt": "2023-10-28T13:09:50Z"
                }
            },
            ...
]

Here, we will need a videoId and a videoPublishedAt fields.

Only at this step, after having a list with video IDs, can we find a number of views, likes, and comments for each video:

def get_video_by_id(api: Api, video_id: str) -> Tuple[str, str, str]:
    """ Get video details by id """
    video_info = api.get_video_by_id(video_id=video_id, parts=["statistics"], return_json=True)
    if len(video_info["items"]) > 0:
        item = video_info["items"][0]
        views = item["statistics"]["viewCount"]
        likes = item["statistics"]["likeCount"]
        comments = item["statistics"]["commentCount"]
        return views, likes, comments
    return None, None, None

As a final step, I created a method that combines all these parts together:

def get_channel_videos(api: Api, channel_id: str, limit: int) -> List:
    """ Get videos for the channel """
    videos_data = []
    title, uploads, subscribers = get_channel_info(api, channel_id)
    if title is not None and uploads is not None:
        title_ = title.replace(";", ",")
        videos = get_playlist_items(api, uploads, limit)
        for video_id, video_published_at in videos:
            views, likes, comments = get_video_by_id(api, video_id)
            videos_data.append((channel_id, title_, subscribers, video_id, video_published_at, views, likes, comments))
    return videos_data

A limit variable is useful for debugging; it allows us to minimize the number of requests for each query and not exceed the API quota limit.

As was mentioned earlier, only channel owners can get historical and analytics data; we can only get the data available at the current moment. But we can easily request the data (number of videos and their views, likes, and comments) periodically. Using Apache Airflow, running on the Raspberry Pi, I kept this code running for three weeks. The requests were executed every 3 hours, and the output of each request was saved in CSV (more details and a DAG example are available in the first part). Let's now see what kind of insights we can get.

2. ETL (Extract, Transform, Load)

As usual, before using the data for analytics, we need to transform it into a convenient form. Our ETL process is pretty straightforward. As an output from the Apache AirFlow task, I got a lot of CSV files. Let's load these files and combine them into one dataset:

import pandas as pd
import glob

channel_files = glob.glob("data/video*.csv")
channels_data = []
for file_in in channel_files:
    channels_data.append(pd.read_csv(file_in, delimiter=";",
                                     parse_dates=["timestamp"],
                                     date_format="%Y-%m-%d-%H-%M-%S"))
df_channels = pd.concat(channels_data)

Let's check the result for one video:

display(df_channels.query('videoId == "8J...4"').sort_values(by=["timestamp"], ascending=True))

The output looks like this:

Each row has a timestamp, video ID, video publication time, and the number of views, likes, and counts at the time when the data was collected. As we can see, the video 8J...4 was published on October 27, 2023, at 19.00. At the beginning of my observation, it already had 514,948 views, and at the end of the dataframe, the number of views increased to 978,573.

Now, we're ready to have some fun.

3. Data Analysis

3.1 Number of ViewsAs a warm-up, let's display the number of views per video. I will use only videos made within the last two months.

channel_id = "UCu..."
df_channel = df_channels[df_channels["channelId"] == channel_id]
df_channel = df_channel.sort_values(by=['timestamp'], ascending=True)

# Videos published within interval
days_display = 2*31
start_date = df_channel["timestamp"].max() - pd.Timedelta(days=days_display)  
end_date = df_channel["timestamp"].max()
df_channel = df_channel[(df_channel["videoPublishedAt"] >= start_date) &
                        (df_channel["videoPublishedAt"] < end_date)]

I collected the channel data every 3 hours, so I need only the last timestamps:

step_size = 3        
interval_start = df_channel["timestamp"].max() - pd.Timedelta(hours=step_size)
interval_end = df_channel["timestamp"].max()

df_interval = df_channel[(df_channel["timestamp"] >= interval_start) &
                         (df_channel["timestamp"] < interval_end)]
df_interval = df_interval.drop_duplicates(subset=["videoId"])

v_days = df_interval["videoPublishedAt"].values
v_views = df_interval["viewCount"].values

Let's draw the bar chart using Matplotlib:

import matplotlib.pyplot as plt
import matplotlib.dates as mdates

fig, ax = plt.subplots(figsize=(16, 4))

cmap = plt.get_cmap("Purples")
views_max = 3_000_000
views_avg = df_channel.drop_duplicates(subset=["videoId"], keep="last")["viewCount"].median()  # Median value
rescale = lambda y: 0.5 + 0.5 * y / views_max
# Bar chart
ax.bar(v_days, v_views,
       color=cmap(rescale(v_views)),
       width=pd.Timedelta(hours=12))
# Add horizontal median line
ax.axhline(y=views_avg, alpha=0.2, linestyle="dotted")
trans = ax.get_yaxis_transform()
ax.text(0, views_avg, " Median ", color="gray", alpha=0.5, transform=trans, ha="left", va="bottom")
# Title
subscribers = df_channel.iloc[[0]]["subscribers"].values[0]
title_str = f"YouTube Channel, {subscribers/1_000_000:.1f}M subscribers"
# Adjust axis
ax.xaxis.set_major_formatter(mdates.DateFormatter("%d/%m"))
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, p: format(int(x), ",")))
ax.xaxis.set_major_locator(mdates.WeekdayLocator(byweekday=mdates.SU))
ax.set(title=title_str,
       xlabel="Video Publication Date",
       ylabel="Views",
       xlim=(start_date, end_date),
       ylim=(0, views_max))
plt.tight_layout()
plt.show()

Here, I use the ax.bar to draw a bar chart and a rescale function to adjust the bar colors. The horizontal median line helps to see if the views are above or below the average.

First, let's see the result for a channel with 23,9M subscribers, publishing videos in the "makeup" category:

Number of views per last 2 months, Image by author

The result is interesting. We can see that the number of views per video is more or less consistent, and the median value is about 1M views. Is it a lot? The value itself is obviously large. But the channel has almost 24M subscribers, who are supposed to be interested in this content and to get notifications when new videos are published. A 1/24 ratio does not look big to me; are maybe people subscribing but losing interest pretty fast?

Another interesting "anomaly" with this channel caught my attention. Occasionally, I set the display interval to 1 year, and a large increase in the number of views became visible:

Number of views per year, Image by author

Apparently, the author published a lot of "short" videos, which got many (3–10 million) views in that period. What happened later? Maybe the chief editor of the channel has been changed? Maybe creating "shorts" was not profitable? I don't know. It may be possible to watch all videos in the web browser and try to find a reason, but it was definitely out of the scope of this test, and I'm not an expert in makeup anyway.

As another example, let's see views for another channel with 17.8M subscribers, making gadget reviews:

I don't know if the results are content-specific ("gadget reviews" and "makeup" are naturally, intended for different audiences), but this channel has a much higher median number of views per video compared to the first one.

Let's now see how many views channels with a smaller audience can get. This gadget-related channel has 1.3M of subscribers:

The difference is significant. A channel with 17.8M subscribers gets about 3 million views per video, and the channel with 1.3M subscribers gets "only" 300K. For comparison, the next photography-related channel has a 115K audience:

In this case, the channel has a median number of 25K views per video.

Obviously, videos are not only shown to subscribers but are also displayed to anyone by the YouTube recommender system. What is the real ratio? We don't know. From the bar charts, I can guess that only about 20% of subscribers are "active". Others probably subscribed a long time ago and are not interested in the content anymore. It makes sense; for example, if I am going to buy a laptop, I can subscribe to a hardware reviews channel, but I would not be interested anymore after making a purchase.

3.2 Views DynamicWe were able to see the number of views per video, but how fast can videos get these views? We already have a Matplotlib bar chart; let's animate it! Only channel owners have access to the historical data, but I made the requests within three weeks, and we can easily see how the values changed within this interval. To do this, we only need to update the graph:

import matplotlib.animation as animation

def animate_bar(frame_num: int):
    """ Update graph values according to frame number """
    interval_start = df_channel["timestamp"].min() + pd.Timedelta(hours=step_size*frame_num)
    interval_end = df_channel["timestamp"].min() + pd.Timedelta(hours=step_size*(frame_num+1))
    day_str = interval_start.strftime('%d/%m/%Y %H:00')
    days, views = get_views_per_interval(df_channel, interval_start, interval_end)
    print(f"Processing {day_str}: {views.shape[0]} items")
    bar = ax.bar(days, views,
                 color=cmap(rescale(views)),
                 width=pd.Timedelta(hours=bar_width))
    day_vline.set_xdata([interval_start])

    ax.set(title=f"{title_str}: {day_str}")
    return bar,

step_size = 3
num_frames = (df_channel["timestamp"].max() - df_channel["timestamp"].min())//pd.Timedelta(hours=step_size)
anim = animation.FuncAnimation(fig, animate_bar, repeat=True, frames=num_frames)
writer = animation.PillowWriter(fps=5)
anim.save("output.gif", writer=writer)

Here, I created an FuncAnimation object, which has an animate_bar function as a parameter. This function will be called automatically with different frame numbers; inside this function, I create a new bar chart and update the title. I also added a vertical line, representing the current date.

The output looks like this:

Number of views within 3 weeks, Image by author

From this animation, we can see that a new video apparently gets at least 70% of views within the first week. Older videos also get some views, but this process is much slower.

But there can be an exception. In the next example, a channel has a 90K median number of views per video, but one of the videos probably became viral, was shared a lot, and got about a million views within 2–3 weeks:

3.3 Views DistributionAfter watching the bar chart, I've asked myself a question: Is the number of views distribution normal? Obviously, some videos are getting more views than others, but how consistent is that? It is easy to find an answer using Seaborn's histplot method.

import seaborn as sns

channel_id = "UCu..."
df_channel = df_channels[df_channels["channelId"] == channel_id]
display(df_channel.drop_duplicates(subset=["videoId"]))

step_size = 3
interval_start = df_channel["timestamp"].max() - pd.Timedelta(hours=step_size)
interval_end = df_channel["timestamp"].max()
df_interval = df_channel[(df_channel["timestamp"] >= interval_start) & (df_channel["timestamp"] < interval_end)].drop_duplicates(subset=["videoId"])

# Title
subscribers = df_channel.iloc[[0]]["subscribers"].values[0]
title_str = f"YouTube Channel, {subscribers/1_000_000:.1f}M subscribers"
# Median
views_avg = df_channel["viewCount"].median()
# Draw
fig, ax = plt.subplots(figsize=(12, 5))
sns.set_style("white")
sns.histplot(data=df_interval, x="viewCount", stat="percent", bins=50)
ax.set(title=title_str,
       xlabel="Views Per Video",
       ylabel="Percentage",
       xlim=(0, None),
       ylim=(0, 18)
       )
ax.axvline(x=views_avg, alpha=0.2, linestyle="dotted")
ax.xaxis.set_major_formatter(FuncFormatter(lambda x, p: format(int(x), ',')))
plt.tight_layout()
plt.show()

For this test, I used 500 as a video limit for the API request. The result for a channel in the "gadget reviews" category looks like this:

Number of views distribution, Image by author

17.8 million subscribers is a large number. This channel is definitely one of the top ones, and as we can see, it is producing more or less consistent results. The distribution looks normal, but it is slightly skewed. The median value for this chart is 3.8M views per video, but some videos got more than 10M views, and only 3 videos from 500 got more than 20 million.

A similar pattern can be seen for other channels with fewer subscribers, but in this case, the distribution is more skewed:

This data may require a more detailed analysis. For example, it turned out that "normal videos" and "shorts" can have drastically different numbers of views, and ideally, they should be analyzed separately.

3.4 Bonus: Views Per Single VideoThis article is already long, and I will give a bonus to readers who are patient enough to read it to this point. In 3.2, I made the animation, showing that most of the videos got a major number of views soon after the publication (by the way, this is also true for the TDS and Medium articles as well). Can we see this process in more detail? Actually, we can. I collected the data within several weeks, and there were enough videos that were published during that interval. Finding the latest videos is straightforward because we have a videoPublishedAt parameter:

# Find the newest videos for a specific channel
df_channel = df_channels[df_channels["channelId"] == "UCB..."]

num_videos = 5
df_videos = df_channel.drop_duplicates(subset=["videoId"]).sort_values(by=["videoPublishedAt"], ascending=False)

As a reminder, the data for a specific video looks like this:

Then, I "normalized" this data: my goal is to display a number of views from a publication time, which I will consider "0":

def get_normalized_views(df_channel: pd.DataFrame, video_id: str) -> pd.DataFrame:
    """ Get relative views for a specific video """
    df_video = df_channel[df_channel["videoId"] == video_id].sort_values(by=['timestamp'], ascending=True)

    # Insert empty row with zero values at the beginning
    video_pub_time = df_video.iloc[[0]]["videoPublishedAt"].values[0]
    start_row = {'videoPublishedAt': video_pub_time,
                 'timestamp': video_pub_time,
                 'viewCount': 0, 'likeCount': 0, 'commentCount': 0}
    df_first_row = pd.DataFrame(start_row, index=[0])
    df_video_data = df_video[df_first_row.columns]
    df_video_data = pd.concat([df_first_row, df_video_data], ignore_index=True)

    # Make timestamps data relative, starting from publication time
    df_first_row = df_video_data.iloc[[0]].values[0]        
    df_video_data = df_video_data.apply(lambda row: row - df_first_row, axis=1)
    df_video_data["daysDiff"] = df_video_data["timestamp"].map(lambda x: x.total_seconds()/(24*60*60), na_action=None)
    return df_video_data

Here, I also converted timestamps to the number of days since the publication date to make a graph more convenient to read.

Now, we can draw the graph using Matplotlib:

fig, ax = plt.subplots(figsize=(10, 6))
# Title
subscribers = df_channel.iloc[[0]]["subscribers"].values[0]
title_str = f"YouTube Channel with {subscribers/1_000_000:.1f}M Subscribers, Video Views"
# Videos data
for p in range(num_videos):
    video_id = df_videos.iloc[[p]]["videoId"].values[0]
    df_video_data = get_normalized_views(df_channel, video_id)
    plt.plot(df_video_data["daysDiff"], df_video_data["viewCount"])
# Params
ax.set(title=title_str,
       xlabel="Days Since Publication",
       ylabel="Views",
       xlim=(0, None),
       ylim=(0, None))
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, p: format(int(x), ',')))
ax.tick_params(axis='x', rotation=0)
plt.tight_layout()
plt.show()

The result looks like this:

Number of views per video, Image by author

Here, the line lengths are different because videos were published at different times. The oldest video was published almost two weeks ago, and the latest video was published only two days before this data was collected.

From this graph, I have two interesting observations.

First, at least for that channel, my assumption was correct, and these videos got the maximum number of views immediately after the publication. And even more, the curves (for example, red and green ones) are almost identical.

Second, attentive readers may see two distinct groups there – the first two videos are getting about 3M views, and another three videos apparently will get about 0.5M views. Indeed, the videos are different. Lines at the top represent "Normal" videos, and lines at the bottom represent "YouTube Shorts". Apparently, at least for that channel, the audience interest in "shorts" is lower.

But obviously, the results may vary. Firstly, some videos may become more popular or even viral; they can get a much bigger amount of views:

Secondly, the content itself also matters. For example, gadget reviews are mostly interesting when they are "fresh," but videos about health, relationships, sports, makeup, or any similar topic can be useful for viewers for a longer time. Last but not least, these particular channels have a lot of subscribers, and videos get many views soon after publication. For "newbies", the results may be different, and most of the viewers for a new channel may come from the YouTube recommender system or search results. Thus, I can only recommend that readers do their own research and select a YouTube channel that is mostly close to what they want to know.

Conclusion

In this article, I showed how to collect and analyze data about different YouTube channels and videos. In the first part, I focused on general properties, like the number of views per channel. In this part, I focused on individual videos. We were able to see how often new videos are published on different channels, how many views they can get, and how fast this process is going. This analytics is usually available only for channel owners, but with the help of the YouTube Data API, we can collect the data for free with high precision. This can be interesting not only for those who want to start a new channel but also from cultural and statistical perspectives.

Obviously, YouTube is a gigantic streaming platform, with millions of channels and billions of videos. Videos about cats, mathematical problems, or laptop reviews can get absolutely different numbers of views, likes, and comments. So, I encourage readers to do their own tests with the channels they are interested in. In this article, I focused only on views, but the number of comments or likes can be analyzed the same way as well (by the way, we can get likes via API, but YouTube removed the number of dislikes from public access in 2021).

In the next and last part, I will focus on YouTube "Shorts". These types of videos are displayed on a separate YouTube page, which has a different UI, and apparently, the number of views or likes can be drastically different. Stay tuned.

Those who are interested in social data analysis are also welcome to read other articles:

If you enjoyed this story, feel free to subscribe to Medium, and you will get notifications when my new articles will be published, as well as full access to thousands of stories from other authors. Full source code and a Jupyter notebook for this article are also available on my Patreon page.

Thanks for reading.

Tags: Data Science Hands On Tutorials Programming Social Media YouTube