Maximizing Python Code Efficiency: Strategies to Overcome Common Performance Hurdles

Author:Murphy  |  View: 26463  |  Time: 2025-03-22 23:18:29
Photo by Kevin Canlas on Unsplash

Overview

In this exploration of Python code optimization, we look at common issues that impede performance resulting in overheads. We analyze two issues here- one related to nested loops, and the other related to memory/allocation issues caused by reading huge datasets.

With the nested loop issue, we look at an example use case to understand the nested loop dilemma and then move on to a solution that serves as an alternative to circumvent performance issues caused by nested loops.

With the the memory/allocation issues encountered with large datasets, we explore multiple data reading strategies and compare the performance of each strategy. Let's explore further.

Performance Issue-1: The Nested Loop Dilemma

Nested Loop Dilemma – Photo by 愚木混株 cdd20 on Unsplash

While nested loops are a common programming construct, their inefficient implementation can lead to suboptimal performance. A notable challenge one might encounter with nested loops is the ‘kernel keeps running' issue. This happens when the code has nested loops that are inefficiently implemented, leading to prolonged execution times; and in most cases, an infinite loop. Nested loops are easy to implement but optimizing for performance sometimes requires sacrificing the simplicity of nested structures. Nested loops can contribute to higher algorithmic complexity, leading to longer execution times, especially when dealing with large datasets. It's important to note that while nested loops might not be inherently "bad," understanding their implications and considering alternative approaches can lead to more efficient Python code. In this case, it is good to consider Python's features and libraries effectively.

A Use Case to discuss Nested Loops

We have two files where a few records are duplicates of one another. There is an identifier column in both files but the ID values are in a different format. Therefore, if we were to compare the entire row of one file with the other, it would not be flagged as a duplicate. Hence, the only way to determine duplicates would be to compare specific columns other than the identifier column.

As an example, let's assume that the files hold the following values –

File1.csv

ID,Name,Value A123,John,100 B456,Jane,200 C789,Bob,150

File2.csv

ID,Name,Value A-123,Jane,200 B_456,Bob,150 C-789,John,100 D-839,Sarah,180

Let's look at an implementation using nested loops. We load the files into two dataframes (df_primary and df_secondary) and then iterate through both dataframes using two loops in a nested fashion.

df_primary['Flag'] = ''

# Iterate over rows in the primary dataframe
for idx_primary, row_primary in df_primary.iterrows():
    # Iterate over rows in the secondary dataframe
    for idx_secondary, row_secondary in df_secondary.iterrows():
        # Check for matching conditions
        if (row_primary['Column1'] == row_secondary['Column1']) and 
           (row_primary['Column2'] == row_secondary['Column2']) and 
           (row_primary['Column3'] == row_secondary['Column3']) and 
           (row_primary['Column4'] == row_secondary['Column4']):
            # Update the 'Flag' column in the primary dataframe for the current row
            df_primary.at[idx_primary, 'Flag'] = 'Y'

In this case, ‘iterrows' has been used along with nested loops. Iterrows used along with Nested Loops creates additional computational cost owing to the need for implicit type conversions.

Alternative to Nested Loops – pd.merge

For better performance, it is recommended to use the ‘merge' function within Pandas. pd.merge is a function in the pandas library used for merging two dataframe objects based on a common column or index. It is a high-level function that provides a convenient way to perform database-style joins. The merging process involves matching and combining data based on specified keys. Moreover, pandas uses vectorized operations to handle this task efficiently.

merged_df = pd.merge(df_primary, df_secondary, on=['Column1','Column2','Column3','Column4'], how='outer',indicator=True)

The indicator parameter in the pd.merge function is used to add a special column named ‘_merge' to the resulting dataframe that indicates the source of each row. In the resulting dataframe (merged_df), the ‘_merge' column will indicate whether each row came from the left dataframe only (left_only), the right dataframe only (right_only), or from both dataframe (both). This can be particularly useful when we would like to track which dataframe contributed to a particular row during the merging process. In this case, we can use the rows corresponding to the value ‘both' in the ‘_merge' column. With this operation, each set of duplicates get combined into 1 row in the merged_df. Thereby, ‘merged_df' will have only the unique rows. This solution is not only syntactically simpler but also helps reduce performance overheads and eliminates the need for iterations.

Performance-Issue 2: Memory/Allocation Issues

While trying to load a large file, you may have encountered the below error – MemoryError: Unable to allocate 512. KiB for an array with shape (65536,) and data type int64. This indicates that the program has run out of memory when trying to perform memory-intensive operations such as loading huge datasets. This happens especially while trying to read a dataset on a single machine (using libraries such as Pandas read_csv).

As a pre-requisite, please download the dataset ‘data_use_article_database' from https://datacatalog.worldbank.org/search/dataset/0065200/Data-Use-in-Academia-Dataset. (This dataset is licensed under Creative Commons Attribution 4.0-https://datacatalog.worldbank.org/public-licenses).

The file gets downloaded as a ‘.csv' file of size 1.9 GB. Note that we try loading this dataset into a dataframe in Python on a computer with an 8 GB RAM. The purpose of this exercise is to solely try loading a large dataset using a computer that has minimum resources; and to then determine the most optimal strategies so as to avoid the memory error mentioned above.

Strategies to overcome memory/allocation issues

Basic Chunking

One strategy to consider reading the file in chunks because is is here that our machine goes out of memory. When the chunksize parameter is specified in the pandas read_csv function, the result (‘chunks') is an instance of pandas.io.parsers.readers.TextFileReader. We then read one chunk at a time by iterating through the ‘chunks' variable using a ‘for' loop. In this case, the variable ‘chunk' is a temporary dataframe which is concatenated into the final dataframe ‘result_df'.

import os
import pandas as pd
csv_file_path = 'results_completed_updated_20231003.csv'
chunk_size = 10000
chunks = pd.read_csv(csv_file_path,chunksize=chunk_size)
#Process each chunk
result_df = pd.DataFrame()
for chunk in chunks:    
    #Concatenate the results
    result_df = pd.concat([result_df, chunk], ignore_index=True)

Note that the chunk size can be determined based on the amount of memory you can allocate to this process of reading a file.

The efficiency of the basic chunking strategy shown above depends on various factors, including the size of your dataset, the available system resources, and the specific operations being performed on each chunk.

Improved Chunking

There are a couple of considerations and potential improvements that can be done to the basic chunking strategy discussed above.

The pd.concat operation may become inefficient as the size of result_df grows, especially when concatenating large dataframes in each iteration. This is because it creates a new dataframe in each iteration, and copying data can be expensive.

Instead of concatenating dataframes in each iteration, consider storing the chunks in a list and concatenating them once outside the loop. With the Basic Chunking, the pd.concat requires three dataframe copies (result_df, chunk and final assignment df ‘result_df') to be held in memory whereas with the Improved Chunking, we only use a list to append data.

chunks_list = []
csv_file_path = 'results_completed_updated_20231003.csv'
chunk_size = 10000
chunks = pd.read_csv(csv_file_path, chunksize=chunk_size)
result_df = pd.DataFrame()

for chunk in chunks:
   #Process each chunk
   chunks_list.append(chunk)

result_df = pd.DataFrame(chunks_list)

Parallel Processing Strategy

Another option is to use a parallel processing strategy. If you like a Pandas-like interface, consider using the library ‘Dask'. Dask provides parallel and distributed computing capabilities for larger-than-memory datasets.

Dask splits the ‘.csv' file into multiple chunks where each chunk is processed independently on different workers. Each worker reads its assigned chunk concurrently, leading to parallel reading.

A worker is a separate process or thread that performs computations. With Single-machines, workers use multiple threads or processes to parallelize computations. With a cluster, Dask distributes computations across multiple nodes in a cluster. Each machine in the cluster becomes a worker, and Dask coordinates the execution of tasks across these distributed workers.

Here is a basic diagram explaining operations using Dask.

Fig. 1. Basic Diagram explaining ‘Dask' (Image by Author)

Let's try reading the dataset in hand using dask.

import dask.dataframe as dd
csv_file_path = 'results_completed_updated_20231003.csv'
ddf = dd.read_csv(csv_file_path, assume_missing=True, dtype={'titleId':'object', 'ordering':'object', 'title':'object', 'region':'object', 'language':'object', 'types':'object',
'attributes':'object', 'isOriginalTitle':'object'})
result_df = ddf.compute()

In this case, we read the ‘.csv' file using the read_csv option of the ‘dask' library. When the parameter assume_missing is set to True, Dask treats certain values, such as NaN (Not a Number) or None, as missing values. However, with a value set to False, Dask will not make assumptions about specific values representing missing data. All values will be treated as regular data, and no special handling for missing values will be applied.

Moreover, providing explicit data types (dtype) is optional, as Dask is designed to automatically infer data types during the reading process. However, there are situations where you might want to explicitly specify data types using the dtype parameter in dd.read_csv. Below are some reasons for the same –

Improved Performance: Explicitly specifying data types can lead to better performance. Dask's automatic type inference might incur some overhead, especially for large datasets. Providing explicit data types allows Dask to skip the inference step.

Memory Efficiency: When Dask reads a CSV file, it needs to scan a sample of the data to infer the data types. For large datasets, this process can consume a significant amount of memory. Specifying data types can help with memory efficiency.

Avoiding Type Inference Errors: In some cases, Dask's automatic type inference might not correctly deduce the intended data type, especially for columns with mixed data. Specifying data types explicitly helps avoid potential errors in type inference.

Customizing Data Types: If you have specific requirements for the data types of certain columns, you can use the dtype parameter to customize them. For example, you might want to treat a column as a string even if it contains numeric values. In such cases, it makes sense to explicitly specify datatypes.

Let's move on the next data reading strategy.

Columnar Data Reading & Parallel Processing Strategy (Hybrid)

This strategy involves reading data in a columnar format, which is the primary focus of PyArrow. Columnar data representation is more efficient for certain analytical and processing tasks compared to row-based formats. In this use case, we use Dask to purely read the large data file into a Dask Dataframe. We then load the Dask Dataframe into a PyArrow table that can further be used for complex analytical tasks.

import dask.dataframe as dd
import pyarrow as pa
import pyarrow.csv as csv
import time

#Specify the CSV file path
csv_file_path = 'results_completed_updated_20231003.csv'

start_time = time.time()

#Read the file into a Dask DataFrame
ddf = dd.read_csv(csv_file_path, assume_missing=True,dtype={'titleId':'object', 'ordering':'object', 'title':'object', 'region':'object', 'language':'object', 'types':'object',
'attributes':'object', 'isOriginalTitle':'object'})

#Convert Dask DataFrame to PyArrow Table
table = pa.Table.from_pandas(ddf.compute(), preserve_index=False)

end_time = time.time()

#Calculate time taken in seconds
time_taken_seconds = end_time - start_time

#Convert time to minutes
time_taken_minutes = time_taken_seconds / 60

#Display the PyArrow Table
print(table)

print(f"Time taken: {time_taken_minutes} minutes")

The use of use_threads=True in read_options indicates that PyArrow is configured to use multiple threads for parallelized reading. This can lead to faster reading of data; especially when reading from disk. The strings_can_be_null=True under ‘convert_options' allows PyArrow to handle cases where string values in the data can be represented as null.

Evaluating Performance of the strategies

Let's compare the time taken by the Improved Chunking, the Parallel Processing Strategy and the hybrid strategy involving columnar data reading & parallel processing. We leave out the Basic Chunking strategy as it is quite time-consuming.

import time
chunks_list = []
csv_file_path = 'results_completed_updated_20231003.csv'
chunk_size = 10000
start_time = time.time()
chunks = pd.read_csv(csv_file_path, chunksize=chunk_size)
result_df = pd.DataFrame()

for chunk in chunks:
   #Process each chunk
   chunks_list.append(chunk)

result_df = pd.concat(chunks_list, ignore_index=True)

end_time = time.time()

#Calculate time taken in seconds
time_taken_seconds = end_time - start_time

#Convert time to minutes
time_taken_minutes = time_taken_seconds / 60
time_taken_minutes

With the Improved Chunking Strategy, the time taken was between 2.15 and 2.38 minutes.

Python">import dask.dataframe as dd
csv_file_path = 'results_completed_updated_20231003.csv'

start_time = time.time()

ddf = dd.read_csv(csv_file_path, assume_missing=True, dtype={'titleId':'object', 'ordering':'object', 'title':'object', 'region':'object', 'language':'object', 'types':'object',
'attributes':'object', 'isOriginalTitle':'object'})
result_df = ddf.compute()
end_time = time.time()

# Calculate time taken in seconds
time_taken_seconds = end_time - start_time

# Convert time to minutes
time_taken_minutes = time_taken_seconds / 60
time_taken_minutes

With the Parallel Processing Strategy, the time taken was between 0.9 and 1.2 minutes considering multiple runs.

import dask.dataframe as dd
import pyarrow as pa
import pyarrow.csv as csv
import time

csv_file_path = 'results_completed_updated_20231003.csv'

start_time = time.time()

#Read the CSV file into a Dask DataFrame
ddf = dd.read_csv(csv_file_path, assume_missing=True,dtype={'titleId':'object', 'ordering':'object', 'title':'object', 'region':'object', 'language':'object', 'types':'object',
'attributes':'object', 'isOriginalTitle':'object'})

#Convert Dask DataFrame to PyArrow Table
table = pa.Table.from_pandas(ddf.compute(), preserve_index=False)

end_time = time.time()

#Calculate time taken in seconds
time_taken_seconds = end_time - start_time

#Convert time to minutes
time_taken_minutes = time_taken_seconds / 60

#Display the PyArrow Table
print(table)

print(f"Time taken: {time_taken_minutes} minutes")

With the Columnar Data Reading & Parallel Processing Strategy, the time taken was between 1.5 and 2.5 minutes for multiple runs.

Considerations & Concluding Remarks

In conclusion, addressing the dual challenges of the Nested Loop Dilemma and Memory/Allocation Issues in data processing is paramount for achieving optimal performance. In Performance Issue-1, a detailed use case was presented, and the implementation of a strategic solution using pd.merge was recommended. Leveraging the power of Pandas and its efficient ‘merge' function can significantly enhance computation speed and streamline complex data operations.

With Performance Issue-2, the focus shifted towards tackling Memory/Allocation Issues, a common hurdle in handling large datasets. Various chunking strategies were explored as effective remedies to overcome memory limitations. Employing methods such as reading data into smaller, manageable chunks; and leveraging parallel processing techniques can substantially improve memory efficiency and enable the handling of extensive datasets without compromising performance.

If your primary concern is working with large datasets that don't fit into memory, Dask alone would be sufficient. If you have specific tasks that benefit from PyArrow's columnar data representation (e.g., analytical operations), combining Dask with PyArrow would be beneficial (Columnar Data Reading & Parallel Processing Strategy).

By addressing these performance issues through targeted solutions, the hope is that data scientists and analysts can ensure that their workflows are optimized for improved speed, scalability, and resource utilization.

Tags: Dask Parallel Processing Python Python Optimization Tips And Tricks

Comment