Social Network Analysis with NetworkX: A Gentle Introduction

Author:Murphy  |  View: 28866  |  Time: 2025-03-23 12:35:31
Image source: pixabay.com

How do streaming platforms like Netflix recommend the perfect movie for your evening? How can we optimize player positioning in sports like soccer? How can we identify fraudulent activities in financial networks? How can we predict a match in dating apps like Tinder?

All these challenges have the same solution: Social Network Analysis.

Social Network Analysis (SNA) is a powerful Data Science field that allows to explore the intricate relationships within a network of entities, discovering hidden patterns and capturing insights. Network entities can be everything: they can be people on a social network like Instagram, websites on the internet, or even sport teams in a tournament network.

In the following sections, I will help you understand what are networks in data science, I will describe the most common types, and I will implement several algorithms to extract valuable information from the networks.

All the theoretical parts will be combined with practical examples. You will find several Python code snippets to familiarize yourself with the NetworkX Python library.

Understanding Networks

A network, also called a graph, is a particular representation of information consisting of two essential components: nodes and edges.

Nodes are the building blocks of a network. They represent the individual entities within a network. They can impersonate a wide variety of things such as people in a social network but also genes in a biological network.

Edges, instead, represent the relationships between the nodes. They can define interactions between two people, associations, or dependencies. Edges may be directed, meaning that the relationship they represent goes from one node to another. Think of a directed edge representing the payment made by a person to another, it is indeed beneficial to know the direction of the money exchange. This asymmetry is represented with arrows.

In NetworkX, nodes and edges are easily represented as follows:

import networkx as nx

# Create an instance of a graph
G = nx.Graph()

# Create an edge between two nodes. The nodes will be also created.
G.add_edge('A', 'B', weight=6)

The result of the above "Hello world" code is this little and naive network:

Image by the author.

However, in a similar fashion, the NetwrokX library can easily create more complex networks such as the one displayed below:

Image by the author. Data source.

Types of Networks

In modeling a real-world network, we must consider that there exist several structures of them. I will briefly introduce the most common networks and show how easy it is to model them with NetworkX.

Directed and Undirected Networks

Directed networks have edges with a defined direction. This indicates an asymetry in the relation between two nodes. For example, in an email network, the direction of the edge can indicate which node sends the email and which one receives it.

On the other hand, undirected networks neglect any direction of the arcs, implying symmetric relations between the nodes. An example could be a friendship network, where usually the relationship is mutual, or a sport tournament network.

In NetworkX the code used to instantiate respectively directed and undirected networks is DiGraph() or Graph(). I randomly generate synthetic networks with the Numpy library:

# Synthetic Directed Graph
Gd = nx.DiGraph()
for i in range(30):
    node_1 = np.random.randint(10)
    node_2 = np.random.randint(10)
    if node_1 != node_2:
        Gd.add_edge(node_1, node_2)   

# Synthetic Undirected Graph
G = nx.Graph()
for i in range(30):
    node_1 = np.random.randint(10)
    node_2 = np.random.randint(10)
    if node_1 != node_2:
        G.add_edge(node_1, node_2)    

You can explore the graphs with some beautiful interactive visualizations:

Synthetic Directed Graph. Image by the author.
Synthetic Undirected Graph. Image by the author.

Weighted Networks

In some cases, the relationships between entities are not equal. We can measure the "strength" of a relationship by adding a weight to each edge. The weight reflects the significance of the interaction between a couple of nodes.

Consider, as an instance, a flight network. Each airport represents a node and edges represent the flights between them. It is convenient to assign a weight to each edge to model information such as the flight duration, distance, or ticket price.

To create a simple weighted network in NetworkX, we need to initialize an empty nx.Graph()object. The second step is to add edges with the add_edge() method and include the weight attribute.

# Initialize the graph
G = nx.Graph()

# Add weighted edges
G.add_edge("A", "B", weight=0.6)
G.add_edge("A", "C", weight=0.2)
G.add_edge("C", "D", weight=0.1)
G.add_edge("C", "E", weight=0.7)
G.add_edge("C", "F", weight=0.9)
G.add_edge("A", "D", weight=0.3)

Finally, we can visualize the weighted network.

Synthetic Weighted Graph. Image by the author.

One variety of weighted networks are signed networks. They take into account the positivity or negativity of each relationship by assigning a sign (+ or -) to each edge.

One example of a signed network is an online product reviews network. Such a network represents products and customers as nodes and customer reviews are edges. Edges can have a positive sign (+) or a negative one (-) reflecting respectively a positive or a negative review.

In NetworkX signed graphs are created as easily as weighted graphs:

# Initialize the graph
G = nx.Graph()

# Add weighted edges
G.add_edge("A", "B", sign='+')
G.add_edge("A", "C", sign='-')
G.add_edge("C", "D", sign='-')
G.add_edge("C", "E", sign='+')
G.add_edge("C", "F", sign='-')
G.add_edge("A", "D", sign='+')

Extracting Information from Networks

Once we understand the different forms networks can assume, we need to know how to extract meaningful insights from them. We may want to find which are the key nodes or edges, and uncover hidden patterns. Fortunately, several algorithms have been developed to answer these questions.

PageRank Algorithm

PageRank is a famous algorithm that measures the importance of nodes within a network. It was originally developed by Larry Page and Sergey Brin at Google and it achieved tremendous results in ranking web pages for their search engine.

The core idea of the PageRank algorithm is that important nodes are the ones that have many incoming edges from other network's nodes, especially from other important nodes.

The step followed by the PageRank algorithm are straightforward:

  1. Assign an initial PageRank score to each node, equal to 1/N (where N is the total number of nodes in the network).
  2. Apply the basic PageRank update rule. It states that each node provides to his neighbors nodes an equal fraction of its PageRank score.
  3. The new PageRank score of each node is updated as the sum of the values it receive from its neighbors.

These steps are repeated until the PageRank scores converge to stable values.

If we take the following 25 nodes network, which I synthetically generated, it is extremely complex to state which is the most influential.

Synthetic Network generated with the PAM algorithm. Image by the author.

We may have some clues, but the PageRank algorithm makes our lives easier by assigning a mathematical value to each node's importance. Fortunately, NetworkX has an in-built PageRank method.

# Generate an artificial network of 25 nodes
G = nx.barabasi_albert_graph(25,4, seed=42)

# Apply the PageRank algorithm and store the scores in a pd dataframe
pagerank_results = nx.pagerank(G, alpha=0.85, max_iter=100, tol=1e-06)
pagerank_results = pd.Series(pagerank_results).sort_values(ascending=False)

# Plot the most importan node scores
fig, ax = plt.subplots()
sns.barplot(x=pagerank_results.iloc[:10].values, y=pagerank_results.iloc[:10].index.astype(str), orient='h', alpha=0.75)
ax.set_xlabel('PageRank Score')
ax.set_ylabel('Node')
ax.spines['top'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
for i in ax.containers:
    ax.bar_label(i,fmt='%.2f')

We can clearly see from the results that nodes 0, 7, and 6 have the highest PageRank score.

Image by the author.

HITS Algorithm

The HITS algorithm aims to identify two types of important nodes in a network: hubs and authorities.

  • Hubs are the nodes that point to many other nodes.
  • Authorities are the nodes that are pointed to by many hubs.

Differently than PageRank, with HITS we can not only measure the importance of each node in the network but also extrapolate insights about its role in the graph.

The steps followed by the HITS algorithm are:

  1. Assign to each node in the network an Authority score and an Hub score equal to 1.
  2. Apply the Authority Update Rule to each node. The Authority score of a node is the sum of the Hub score of the nodes pointing to it.
  3. Apply the Hub Update Rule to each node. The Hub score of a node is the sum of the Authority score of the nodes it is pointing at.

After each iteration, we update the Authority and Hub scores. After a large number of iterations both the scores should converge.

Thanks to the NetworkX library, we can compare the HITS results with the ones provided by NetworkX.

# Convert the previously create graph to directed
G_dir = G.to_directed()
for a, b in list(G.edges()):
    G_dir.remove_edge(a, b)

# Compute HITS scores
hits_results = nx.hits(G_dir, max_iter=100, tol=1e-06)
authority_results = pd.Series(hits_results[1]).sort_values(ascending=False)
hubs_results = pd.Series(hits_results[0]).sort_values(ascending=False)

# Plot the results
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2)
fig.set_figheight(8)
fig.set_figwidth(16)
sns.barplot(x=authority_results.iloc[:10].values, y=authority_results.iloc[:10].index.astype(str), orient='h', alpha=0.75, ax=ax1)
ax1.set_xlabel('Authority Score')
ax1.set_ylabel('Node')
ax1.spines['top'].set_visible(False)
ax1.spines['bottom'].set_visible(False)
ax1.spines['right'].set_visible(False)
ax1.spines['left'].set_visible(False)
for i in ax1.containers:
    ax1.bar_label(i,fmt='%.2f')

sns.barplot(x=hubs_results.iloc[:10].values, y=hubs_results.iloc[:10].index.astype(str), orient='h', alpha=0.75, ax=ax2)
ax2.set_xlabel('Hub Score')
ax2.set_ylabel('Node')
ax2.spines['top'].set_visible(False)
ax2.spines['bottom'].set_visible(False)
ax2.spines['right'].set_visible(False)
ax2.spines['left'].set_visible(False)
for i in ax2.containers:
    ax2.bar_label(i,fmt='%.2f')

We clearly see how the nodes with the higher Authority score are among the ones with the higher PageRank score. Node 0 ranks first in both the two metrics, and nodes 7, 6, and 5 are among the top ones.

Image by the author.

What PageRank misses is the "Hub" role of nodes 13, 8, 10 and20. They are connected with relevant nodes, so we can affirm they play a significant role in the network.

Conclusion

In this post, we saw how extracting information from networks is made more accessible through specialized libraries like NetworkX. Most metrics and network algorithms can be computed with a single function.

What we saw is only an introduction to social network analysis in Machine Learning, topics that should be part of every data scientist toolkit. While we've touched upon some fundamental concepts, the field extends far beyond these introductory insights.

I recommend digging into the resources and references attached to this article.


If you liked this story, consider following me to be notified of my upcoming projects and articles!

Here are some of my past projects:

Ensemble Learning with Scikit-Learn: A Friendly Introduction

Use Deep Learning to Generate Fantasy Names: Build a Language Model from Scratch

References

Tags: Artificial Intelligence Data Science Getting Started Machine Learning Python

Comment