Comparing Performance of Big Data File Formats: A Practical Guide

Author:Murphy | View: 25718 | Time: 2025-03-22 23:14:38

The big data world is full of various storage systems, heavily influenced by different file formats. These are key in nearly all data pipelines, allowing for efficient data storage and easier querying and information extraction. They are designed to handle the challenges of big data like size, speed, and structure.

Data engineers often face a plethora of choices. It's crucial to know which file format fits which scenario. This tutorial is designed to help with exactly that. You'll explore four widely used file formats: Parquet, ORC, Avro, and Delta Lake.

The tutorial starts with setting up the environment for these file formats. Then you'll learn to read and write data in each format. You'll also compare their performance while handling 10 million records. And finally, you'll understand the appropriate scenarios for each. So let's get started!

Environment setup
Working with Parquet
Working with ORC
Working with Avro
Working with Delta Lake
When to use which file format?

Environment setup

In this guide, we're going to use JupyterLab with Docker and MinIO. Think of Docker as a handy tool that simplifies running applications, and MinIO as a flexible storage solution perfect for handling lots of different types of data. Here's how we'll set things up:

I'm not diving deep into every step here since there's already a great tutorial for that. I suggest checking it out first, then coming back to continue with this one.

Seamless Data Analytics Workflow: From Dockerized JupyterLab and MinIO to Insights with Spark SQL

Once everything's ready, we'll start by preparing our sample data. Open a new Jupyter notebook to begin.

First up, we need to install the s3fs Python package, essential for working with MinIO in Python.

!pip install s3fs

Following that, we'll import the necessary dependencies and modules.

import os
import s3fs
import pyspark
from pyspark.sql import SparkSession
from pyspark import SparkContext
import pyspark.sql.functions as F
from pyspark.sql import Row
import pyspark.sql.types as T
import datetime
import time

We'll also set some environment variables that will be useful when interacting with MinIO.

# Define environment variables
os.environ["MINIO_KEY"] = "minio"
os.environ["MINIO_SECRET"] = "minio123"
os.environ["MINIO_ENDPOINT"] = "http://minio1:9000"

Then, we'll set up our Spark session with the necessary settings.

# Create Spark session
spark = SparkSession.builder 
    .appName("big_data_file_formats") 
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.4,com.amazonaws:aws-java-sdk-bundle:1.11.1026,org.apache.spark:spark-avro_2.12:3.5.0,io.delta:delta-spark_2.12:3.0.0") 
    .config("spark.hadoop.fs.s3a.endpoint", os.environ["MINIO_ENDPOINT"]) 
    .config("spark.hadoop.fs.s3a.access.key", os.environ["MINIO_KEY"]) 
    .config("spark.hadoop.fs.s3a.secret.key", os.environ["MINIO_SECRET"]) 
    .config("spark.hadoop.fs.s3a.path.style.access", "true") 
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") 
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") 
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") 
    .enableHiveSupport() 
    .getOrCreate()

Let's simplify this to understand it better.

spark.jars.packages: Downloads the required JAR files from the Maven repository. A Maven repository is a central place used for storing build artifacts like JAR files, libraries, and other dependencies that are used in Maven-based projects.
spark.hadoop.fs.s3a.endpoint: This is the endpoint URL for MinIO.
spark.hadoop.fs.s3a.access.key and spark.hadoop.fs.s3a.secret.key: This is the access key and secret key for MinIO. Note that it is the same as the username and password used to access the MinIO web interface.
spark.hadoop.fs.s3a.path.style.access: It is set to true to enable path-style access for the MinIO bucket.
spark.hadoop.fs.s3a.impl: This is the implementation class for S3A file system.
spark.sql.extensions: Registers Delta Lake's SQL commands and configurations within the Spark SQL parser.
spark.sql.catalog.spark_catalog: Sets the Spark catalog to Delta Lake's catalog, allowing table management and metadata operations to be handled by Delta Lake.

Choosing the right JAR version is crucial to avoid errors. Using the same Docker image, the JAR version mentioned here should work fine. If you encounter setup issues, feel free to leave a comment. I'll do my best to assist you

Tags: Apache Spark Big Data Data Analysis Data Engineering Data Storage