Python Data Analysis: What Do We Know About Modern Artists?

Author:Murphy  |  View: 21911  |  Time: 2025-03-22 21:21:12

Different opinions may exist about popular and contemporary culture; it is not only an important part of everyday life but also a billion-dollar business. Thousands of artists make new work in different genres – can we find interesting patterns in that? Indeed, we can, and in this article, I will show a way to extract, analyze, and visualize Wikipedia data.

Why Wikipedia? There are several reasons for that. First, it is an open-source encyclopedia, supported by a lot of people, and the more influential an artist is, the higher the chance that there will be a detailed article about him or her. Second, almost every Wikipedia page has hyperlinks to other pages; this allows us to track different patterns that are not easily visible by the "naked eye." For example, we can see a group of artists performing in a specific genre or even making songs about a specific subject.

Methodology

To perform the analysis, I will implement several steps:

  • Collecting the data. I will use an open-source Wikipedia library for that. The data will be saved as a NetworkX graph, where each node represents a Wikipedia page and each edge represents a link from one page to another.
  • Preprocessing. Wikipedia pages contain a lot of information, but not all data is relevant to our analysis. For example, a page about video game musicians contains links not only to musicians but also to games like "Duke Nukem" or "Call of Duty." I will use an AI Large Language Model (LLM) to group all nodes into different categories.
  • Data analysis. With the help of the NetworkX library, we will be able to find the most popular or important nodes in categories.
  • Data Visualization. This is a large topic on its own, and I will publish it in the next part. I will show how to draw a graph with NetworkX and D3.js.

Let's get started!

1. Collecting the data

This article is focused on information about artists, so as a starting point, I used a List of Musicians Wikipedia page. This page has a lot of extra information, so I saved only the needed part in a text file:

# A
List of acid rock artists
List of adult alternative artists
...

# W
List of West Coast blues musicians

I also created a helper method to load this file into the Python list:

def load_links_file(filename: str) -> List:
    """ Load list of Wikipedia links from a text file """
    with open(filename, "r", encoding="utf-8") as f_in:
        root_list = filter(lambda name: len(name.strip()) > 1 and name[0] != "#", f_in.readlines())
        root_list = list(map(lambda name: name.strip(), root_list))
    return root_list

To read Wikipedia pages, I will use an open-source Python library with the (obvious) name Wikipedia. The same links can be placed on different pages, so it makes sense to use a Python lru_cache to make loading faster:

from functools import lru_cache
import wikipedia

@dataclass
class PageData:
    """ Wikipedia page data """
    url: str
    title: str
    content: str
    links: List

@lru_cache(maxsize = 1_000_000)
def load_wiki_page(link_name: str) -> Optional[PageData]:
    """ Load Wikipedia page """
    try:
        page = wikipedia.page(link_name)
        return PageData(page.url, page.title, page.content, page.links)
    except Exception as exc:
        print(f"load_wiki_page error: {exc}")
    return None

With the help of a Wikipedia library, the process is straightforward. As a first step, I will load the list of links from a text file. Each Wikipedia link (for example, the list of ambient music artists) contains 10–100 names. As a result, I will get a total list of about 2000 Wikipedia pages (I saved this list into a pop_artists.txt file). As a second step, I will use the load_wiki_page function again, and nodes and edges will be saved in a graph:

def load_sub_pages(graph: nx.Graph, page_data: PageData):
    """ Add all links as graph nodes and edges """
    for link_name in page_data.links:
        sub_title = link_name
        graph.add_node(sub_title)
        graph.add_edge(page_data.title, sub_title)

# Create graph
G = nx.Graph()

root_list = load_links_file("pop_artists.txt")
# Retrieve
for artist_name in root_list:
    page_data = load_wiki_page(artist_name)
    if page_data:
        G.add_node(page_data.title)
        load_sub_pages(G, page_data)

# Save to file
nx.write_gml(G, "graph.gml")

Despite the code's simplicity, the process takes a long time. The number of pages is large, and the Wikipedia API is also not the fastest in the world. Practically, the data collection took about a day (I used a Raspberry Pi for that), and the output GML file was about 200 MB in size.

When the graph is saved, we can see its nodes in the Jupyter Notebook:

nodes = [(name, len(G.edges(name))) for name in G.nodes()]

df_nodes = pd.DataFrame(nodes, columns=['Name', 'Edges'])
display(df_nodes.sort_values(by=["Edges"], ascending=False))

Here, I also sorted all graph nodes by the number of edges. If everything was done correctly, the output should look like this:

2.1 Preprocessing

In the first step, we collected data about various artists and made a graph from it, but practically, it is not very useful yet. For example, in the screenshot, we can see the nodes "Hip Hop Music," "AllMusic," and "United States." All these terms obviously belong to different topics, and it would be nice to analyze them separately. This article is focused on artists, and to make a more detailed analysis, I decided to split all nodes into 7 categories:

categories = [
    "person", "place", "country",
    "music style", "music instrument", "music band",
    "other"
]

Now, we need to find a proper category for each graph node. However, it is much easier to say than to do. For every Wikipedia artist's page, a "links" object is a flat Python list that can contain everything, from music instruments to LGBT rights, and there is no easy way to distinguish one subgroup of links from another. Finding a category for words like "Hip hop" or "London" is an NLP (Natural Language Processing) problem, so I decided to use a Large Language Model (LLM) to find a category for each word.

I decided to use a free Llama-3 8B model for the task. After some testing, I created this prompt:

categories_str = ', '.join(categories)

prompt_template = f"""
You are a music expert.
You have {len(categories)} categories: {', '.join(categories)}.
I will give you a list of words. Write the category for each word.
Write the output in the JSON format [{{word: category}}, ...].
Here is the example.

Words:
Rap; John; Billboard Top 100

Your Answer:
[
  {{"Rap": "music style"}},
  {{"John": "person"}},
  {{"Billboard Top 100": "other"}},
]

Now begin! Here is the list of words:
---
_ITEMS_
---

Now write the answer.
"""

I already wrote an article about using LLM to process Pandas dataframes; readers can find more details here:

Process Pandas DataFrames with a Large Language Model

Using the code from that article, I can process the dataframe in several lines of code:

# Load graph
G = nx.read_gml("graph.gml")

# Create dataframe 
nodes = [(name, len(G.edges(name))) for name in G.nodes()]
df_nodes = pd.DataFrame(nodes, columns=['Name', 'Edges'])

# Filter
edge_threshold = 400
df_top_nodes = df_nodes[df_nodes["Edges"] > edge_threshold].copy()

# Process
df_top_nodes["Type"] = llm_map(df_top_nodes["Name"], batch_size=10)

Here, I used two tricks:

  • First, I process only graph nodes that have many edges. The reason is straightforward – I want to find the most important nodes in the graph, and nodes with a small number of edges will be excluded from the tops anyway. We can process all nodes, and there's nothing bad with that, but processing 900,783 nodes using LLM will just take too much time and/or money.
  • I use a small batch size of 10 for processing. As we can see from the prompt, I asked the model to create a JSON as a response, and it turned out that an 8B Llama-3 model is not so good at that. Llama-3 has a context length of 8,192 tokens, and in theory, it can process pretty long prompts. Alas, it does not work well, and when the text size increases, the model starts to make mistakes. For example, it can produce this kind of JSON, which almost looks like a normal JSON, but the colon in the middle is missing, and a Python parser cannot decode it anymore:
[
  {"Rap": "music style"},
  {"John"  "person"},
  {"Billboard Top 100": "other"},
  ...
]

In general, the processing took 30–60 minutes in Google Colab, depending on the selected GPU type and the number of items. The RAM requirement for the 8B model is small, and the task can be done with a free Colab account. Readers who have an OpenAI key can use it as well; the result would be even more accurate, but the processing will not be free.

When the processing is done, we can save a new type field from a dataframe back into the graph:

for index, row in df_top_nodes.iterrows():
    node_name, node_type = row['Name'], row['Type']
    if node_type is not None:
        nx.set_node_attributes(G, {node_name: node_type}, "type")

nx.write_gml(G, "graph_updated.gml")

2.2 Evaluation

We know that the language models are not perfect, even the large ones. Let's load the data back into the dataframe and verify the results:

G = nx.read_gml("graph_updated.gml")

nodes = [(name, len(G.edges(name)), data["type"]) for name, data in G.nodes(data=True)]

df_nodes = pd.DataFrame(nodes, columns=['Name', 'Edges', 'Type'])
display(df_nodes.sort_values(by=["Edges"], ascending=False)[:20])

Here, the column Type contains the result we got from the LLM. I also sorted the nodes by the number of edges and displayed the first 20 items. The result looks like this:

As we can see, it is good enough, especially considering that I used a free model to find the categories. However, the result is not always accurate, and we can make some optional fixes. For example, the Llama-3 model decided that the "MTV" is a place. Generally speaking, it can be true, but in the music industry context, it's not the best choice. We can fix it manually:

fix_nodes = {
    "MTV": "other",
    "Latin Church": "other",
    ...
}
nx.set_node_attributes(G, fix_nodes, "type")

Anyway, an 8B Llama-3 model did most of the job, and only a small number of nodes (about 30 in my code) had to be fixed manually. As was mentioned before, readers can also use OpenAI or any other public API – the result would be more accurate, but the processing will not be free anymore.

3.1 Analysis

Finally, we are approaching the fun part – let's see what kind of data we can get from our Wikipedia artists' dump. As a reminder, I saved all Wikipedia pages from the list of modern musicians and put all pages and their internal links into the graph. With the help of LLM, I split graph nodes into five categories: "person," "place," "country," "music style," "music instrument," "music band," and "other.".

The parameter, which I will check first, is called degree centrality. The degree of centrality for a node is the fraction of nodes it is connected to. With NetworkX, we can get it in one line of code:

degree_centrality = nx.degree_centrality(G)
display(degree_centrality)

#> {'Rapping': 0.00138, '106 and Park': 3.3304, ... 
#   '2013 Billboard Music Awards': 1.22116e-05}

With this data, let's find the top 20 most popular music styles:

nodes = [(name, data) for name, data in G.nodes(data=True) if data["type"] == "music style"]
node_names = [name for name, _ in nodes]
node_centrality = [degree_centrality[name] for name, _ in nodes]
node_edges = [len(G.edges(name)) for name, _ in nodes]

df_centrality = pd.DataFrame({
    "Name": node_names,
    "Centrality": node_centrality,
    "Edges": node_edges

})
display(df_centrality.sort_values(by=["Centrality"], ascending=False)[:20])

The output looks like this:

We can also see it in the graph form:

Music styles bar chart, Image by author

Obviously, this "Wikipedia top" represents not the number of active listeners of the particular genre but its popularity on Wikipedia. It may be interesting to see if these values are correlated, but it is not so easy to do – the popularity of different genres varies per country and media type (streaming, sales, etc.). Anyway, readers are welcome to compare this result with the charts in their own country.

Let's now display the Wikipedia top-20 of artists. The code is the same, I only changed the category filter to "person." The output looks like this:

Here, we can see a fun error in the data processing. Some names like "Jesus" or "Zeus" are obviously not artists, as well as American professor Noam Chomsky. The LLM correctly identified these nodes as "persons," and those links were probably on some musicians' pages. As for real artists, Beyoncé is at the top of our list, close to the Indian composer A. R. Rahman; other names readers can verify on their own.

How is our "Wikipedia rating" correlated to modern audience listening preferences? I compared this table with the 2024 Most Streamed Artists Wikipedia page. Interestingly, the correlation is not that big. For example, Beyoncé is in the 1st place in our rating and at the 22nd position on Spotify; many other names like Frank Sinatra, Elvis Presley, or Celine Dion are not on Spotify's top at all. Why does it happen? Well, I have two ideas:

  • Listening to music on streaming services is mostly "passive." Listeners are not actively engaged in knowing the artists and reading or adding information to Wikipedia about them. It's like a radio that is playing in the background, and people don't actually care about which track is playing at the moment. Even more, the current playlist may be generated by a Spotify recommender system, not by a listener at all.
  • A "Wikipedia rating" of the artists is more relevant not to the number of listeners but to that artist's contribution to music and society in general. For example, I don't think that many people are actively listening to Frank Sinatra nowadays, but his contribution to the culture, music industry, and other artists' work is obviously big.

Anyway, the results are interesting, and more detailed research can be done.

The next interesting metric is betweenness centrality – it is a measure of a node that acts as a bridge along the shortest path between other nodes. In the context of the music industry, it can be an artist who performs in several different genres.

We can measure the betweenness centrality with NetworkX:

node_types = nx.get_node_attributes(G, "type")

def filter_node(node: str):
    """ Get nodes of particular type """
    return node_types[node] == "person"

Gf = nx.subgraph_view(G, filter_node=filter_node)

betweenness_centrality = nx.betweenness_centrality(Gf)

As for the result, alas, it did not work, at least on my PC. Finding the shortest paths requires a lot of calculations, and even for several thousand nodes, it takes too long, and I was not patient enough to get the final result. I don't know if there are more optimized libraries, maybe with CUDA support; if someone knows the answer, feel free to write in the comments.

3.2 Community Detection

The last thing that is interesting to try in this article is community detection. A graph is said to have a community structure if the nodes of the graph can be grouped into sets of nodes. For example, let's find groups of artists performing in different genres.

First, let's filter the graph:

node_types = nx.get_node_attributes(Gf, "type")

def filter_node(node: str):
    """ Get nodes of particular types """
    return node_types[node] in ("person", "music style")

Gf = nx.subgraph_view(Gf, filter_node=filter_node)

NetworkX has a lot of different algorithms to find communities; let's try some of them.

The GirvanNewman algorithm detects communities by progressively removing edges from the original graph:

communities_generator = nx.community.girvan_newman(Gf)
for com in itertools.islice(communities_generator, 7):
    communities_list = com

print(communities_list)
#> ({'A. D. King', 'A. R. Rahman', 'Adam', 'Admiral', ... },
#   {'Cheyenne', 'Jesse James'}, ... }

Naturally, it's hard to estimate the separation quality of hundreds of items manually, but we can do it in a visual form. In my case, the result was not perfect:

Girvan–Newman groups visualization, Image by author

As we can see, the algorithm detected only the biggest "blob" and several short "strings" at the peripherals of the graph.

Let's try another approach. A greedy modularity maximization algorithm uses Clauset-Newman-Moore greedy modularity maximization to find the communities:

communities_list = nx.community.greedy_modularity_communities(Gf,cutoff=10)

The result looks like this:

Greedy modularity groups visualization, Image by author

A manual verification with D3.JS shows that this result is way more interesting. For example, it is easy to see a separate group of musicians performing in the K-pop (Korean Pop) style on the left side of the graph:

Graph visualization, Image by author

Another interesting group of nodes contains not artists but philosophers, which was also interesting to see:

Graph visualization, Image by author

Naturally, other types of communities can also be found; for example, it is possible to group different artists by city, country, or musical instruments they play. A list of possible categories can also be enlarged.

Conclusion

In this article, I described the process of making a graph from Wikipedia pages. To make the analysis, I collected a graph with about 900K nodes, containing information (pages and links) about modern artists. I also processed those nodes with AI to split them into categories. With the help of an open-source NetworkX library, it was possible to find interesting patterns in this graph and to find the most important nodes.

Nowadays, every influential person in modern society has a Wikipedia page. Different pages are connected with each other, and the analysis of this data can be important for social science, musicology, and cultural anthropology. As for the results, they are interesting. For example, a "Wikipedia top rating" of the most popular artists does not match the top Spotify rating. It was surprising to me, and it can be interesting to think how often fans are visiting (or updating) a Wikipedia page of artists they are listening to.

Obviously, the same analysis can be done not only for artists but also in politics, sports, and other domains.

Those who are interested in social data analysis are also welcome to read other articles:

If you enjoyed this story, feel free to subscribe to Medium, and you will get notifications when my new articles will be published, as well as full access to thousands of stories from other authors. You are also welcome to connect via LinkedIn. If you want to get the full source code for this and other posts, feel free to visit my Patreon page.

Thanks for reading.

Tags: Data Science Data Visualization Hands On Tutorials Music Wikipedia

Comment