On the Importance of Compressing Big Data

Author:Murphy | View: 28759 | Time: 2025-03-23 20:03:57

"Data is the new oil", a phrase attributed to Clive Humby, describes the growing reliance of many modern companies on data for their development and success. Companies are collecting massive amounts of data, to the point where units of measurement such as the petabyte, exabyte, and zettabyte, have replaced the megabyte, gigabyte, and terabyte in common discourse. However, mindless collection of data is useless and wasteful. This fact is best summarized by the following expansion of Humby's quote:

Data is just like crude. It's valuable, but if unrefined it cannot really be used.

Michael Palmer

For data to be valuable, the manner in which it is collected, the goals of its use, and how those goals are achieved, need to be carefully designed. One important element of this design is how and where the collected data will be stored. Given the enormous scale of the data being collected today, dedicated solutions are required for its storage. To meet the growing demand, data center storage facilities have increased in size, and cloud based object storage services such as Amazon S3, Google Cloud Storage, and Azure Blob Storage, have increased in popularity.

The Cost of Data Storage

Data storage comes with a number of costs, some more obvious than others. If not managed properly, storage costs can easily become a dominant factor in your monthly R&D expenses. Here are some of the cost considerations.

Direct Storage Costs: While the cost per unit of storage space has been on a steady decline over the years (due to technological advancement), this decline has been more than eclipsed by the increase in demand. Regardless of whether you are using a local data center or a cloud storage service, direct storage costs can climb quickly.

Environmental Costs: Over the past few years we have witnessed an increase in the awareness of the carbon footprint of data centers and calls for increasing sustainability in computational sciences (e.g., see [[here](https://www.osti.gov/servlets/purl/1372902)](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009324)). Although the most significant portion of the carbon footprint is attributed to the compute-heavy data center servers, storage also requires significant investments in power, cooling, and lifecycle replacement. It is estimated that roughly 10% of the data center carbon footprint – equaling the footprint of a medium sized western country – comes from data storage (e.g., see here and here).

Data Streaming Costs: Although not a direct storage cost, attention must be given to the costs associated with pushing and pulling data to and from storage. These could be explicit data transfer costs associated with a cloud storage service, or costs related to designing an infrastructure that supports the communication bandwidths required by the data applications.

The Impact on the Data Application:It is also important to note the impact of how the data is stored, and particularly its size, on the data application. Many data applications rely on the continuous flow of data from storage. In an ideal situation, the data will flow through the system fast enough to keep all of the computation resources of the application host fully utilized. However, if your storage solution is not appropriately designed, your application might remain idle while it waits for input data. This will increase the overall time it takes to run your data applications and also increase the computation costs. We describe this scenario in greater detail in an appendix to this post.

Reducing the Costs of Data Storage

Our data collection and storage strategy solution must account for the potentially high costs of data storage. Some good practices include the following:

Limit the collection and storage of data to what is actually required.
Remove data that is no longer required from storage.
Many cloud service providers offer multiple storage classes – to address different types of data access patterns (e.g., see [Amazon S3](https://aws.amazon.com/s3/) Intelligent-Tiering, Google Storage Classes, and Azure Storage Access Tiers). While the standard storage offerings (e.g., Amazon S3, Google Cloud Storage, and Azure Blob Storage) are recommended for data that is accessed frequently or requires low latency, assigning other data to storage classes that are more archival in nature, can reduce cost.
Store data in a compressed format.

In this post we will focus on data compression as a means to reduce storage costs. We will discuss a few different compression techniques and demonstrate, by example, the potential savings of applying them.

Disclaimers

Just before we dive in, a few disclaimers are in order:

In our discussion we will mention several compression techniques and tools. These are provided merely as examples. Our intention is not to promote these over any one of the many other alternative techniques or tools. The best solution for you will greatly depend on your specific needs.
Before you settle on a compression algorithm (or any published algorithm, for that matter) make sure that you read and understand the associated terms of use.
While the focus of our example will be on image data, the general principles apply to other domains as well.

Data Compression

In this section we will review a few compression techniques and demonstrate them through example. The primary insight that we will reach is that the best results are received by adapting the compression algorithm to the specific properties of the data at hand. A pinnacle example of this can be found in the distinction between lossy and lossless compression schemes.

Lossy vs. Lossless Compression

In a lossless compression scheme no data is lost. When uncompressing the data all of the information is restored: X=uncompress(compress(X)). In a lossy compression scheme some of the information is lost as a result of compression: X≠uncompress(compress(X)). While this loss of data may seem concerning at first, there are many scenarios in which it will have little to no effect. A classic example of this is image compression. Many image compression methods, such as the popular JPEG __ compression, involve a certain degree of loss of data but (provided that a sensible compression configuration is used) will result in an image that is barely distinguishable to the human eye from the original image. There is no doubt that, despite the visual similarity, there may be some algorithms that are sensitive specifically to this loss of data. However, more often than not, it will not have any impact. And the potential savings in storage space can be considerable. We will expand a bit more on the topic of image compression below.

How to Measure Compression Quality

There are a number of ways in which to evaluate the quality of a compression scheme, including:

Compression Ratio: This measures the reduction in the size of data as a result of compression.
Loss of Information: Only relevant in the case of lossy compression, this measures the degree to which the loss of information resulting from the compression impacts the quality of our data. There are many different ways of measuring this loss of quality, depending on the type of data, the domain, what the data will be used for, and more.
Overhead of Compression: Using a compression scheme implies the need to compress and uncompress the data at different stages of the pipeline. Both activities require a certain amount of compute resources and might imply a certain degree of latency. The precise amounts of required compute and latency can vary considerably based on the compression strategy you choose.
Infrastructure Dependencies: Different compression schemes will vary based on their infrastructure dependencies. These can be HW dependencies and/or SW dependencies.

In the example that follows we will measure just the compression ratio and loss of quality. In practice, additional metrics ought to be applied in order to make a fully informed decision about the compression strategy.

Toy Example

To facilitate our discussion, we will consider a toy example in which each data sample includes: an 800×534 RGB image, two associated pixel-wise classification maps with 16 categories each, and a pixel-wise depth map containing the distance (in meters) from the image plane to the 3D location captured by each pixel in the scene. If you would like to follow along with the example, you can apply the code snippet below to the Unsplash image at the top of this blog post.

from PIL import Image
import numpy as np
np.random.seed(0)

im = Image.open('image.jpeg', mode='r')
image = np.array(im)
H,W,C = image.shape

# create artificial labels from image color channels
label1 = image[:,:,0].astype(np.int32)//16
label2 = image[:,:,1].astype(np.int32)//16
depth = (image[:,:,2]+np.random.normal(size=(H,W))).astype(np.float32)

# write all data sample elements to file
with open('image.bin','wb') as f:
    f.write(image.tobytes())
with open('label2.bin','wb') as f:
    f.write(label2.tobytes())
with open('label1.bin','wb') as f:
    f.write(label1.tobytes())
with open('depth.bin','wb') as f:
    f.write(depth.tobytes())

Measuring the storage footprint of the raw data files (e.g., by running ls -l in Linux) we find that the image requires 1.3 MB of storage and the 3 label maps require 1.7 MB each.

Choosing a File Format

One important step in designing your data storage strategy is choosing the format in which to store your data. This decision can have a meaningful impact on the ease and speed of access of your data applications. In particular, your design should take into account the different types of access patterns of different applications. In a previous post we discussed some of the potential implications of the choice file format on ML training. For the sake of simplicity, we will store our data sample in a standard tarball.

import tarfile
with tarfile.open("base.tar", "w") as tar:
    for name in ["image.bin", "label1.bin", "label2.bin", "depth.bin"]:
        tar.add(name)

The resultant file is 6.2 MB.

Compression with ZIP Variants

By "ZIP variants" we refer to a wide variety of highly popular, general purpose file formats and their associated compression schemes, including ZIP, gzip, 7-zip, [bzip2](https://en.wikipedia.org/wiki/Bzip2), Brotli, and more. Note that while we group these together, the underlying algorithms may differ substantially. Many file formats include flags for auto-compressing data samples using a ZIP variant. For example, using bzip2 compression, we can decrease the size of the tarball down to 2.7 MB, a reduction of 2.3X.

import tarfile
with tarfile.open("base.bz2", "w:bz2") as tar:
    for name in ["image.bin", "label1.bin", "label2.bin", "depth.bin"]:
        tar.add(name)

Compression using a ZIP variant is especially compelling due to its versatility. It can be applied generally without requiring any knowledge of the specific types or domains of the underlying data. In the next sections we will see if we can improve this result using a compression scheme that takes into account the details of the raw data types.

Using Lower Precision Data Types

A considerable amount of storage space can be saved by converting the data to use lower bit-precision data types. There are two opportunities for such optimization in our example.

Replace int32 with uint8: A lot of storage space can be saved by using the smallest integer type that addresses your needs. In our case, it is clear that a 32-bit integer representation is overkill for representing our 16-class label maps. These can easily fit in uint8 matrices without any loss of information.
Replace float32 with float16: Contrary to the reduction in integer precision, this operation will result in a loss of information (i.e., it is lossy). This change should be made only after assessing the potential impact that it will have on the data algorithms that consume it. In the code block below, we demonstrate two metrics for measuring the change in data quality. These can be used to predict the impact on the data algorithms. Ideally, we would find some form of correlation between our metrics and the performance of our algorithm, but this is not always so simple.

label1 = label1.astype(np.uint8)
label2 = label2.astype(np.uint8)
depth_new = depth.astype(np.float16)

# measure loss of quality
from numpy import linalg as LA
l_max = LA.norm((depth-depth_new.astype(np.float32)).flatten(),np.inf) # 0.12
l_2 = LA.norm((depth-depth_new.astype(np.float32)).flatten(),2) # 10.47

with open('label1.bin','wb') as f:
    f.write(label1.tobytes())
with open('label2.bin','wb') as f:
    f.write(label2.tobytes())
with open('depth.bin','wb') as f:
    f.write(depth_new.tobytes())

These optimizations alone result in a tarball of 2.9 MB, a reduction of 2.14X.

Merging Elements with Bitwise Operations

At this point each classification map is stored in 8-bit uint8 buffers. However, since each of the maps contain only 16 classes, they are actually using only four bits each. We can further compress the data by combining the two into a single data map.

# compress
combined_label = (label2 * 16 + label1).astype(np.uint8)

# restore
label1 = combined_label % 16
label2 = combined_label // 16

The resultant tarball is 2.5 MB, for an overall reduction of 2.48X.

Note that we could have also considered combining pairs of adjacent elements within each individual label map, thus decreasing their resolution to 400×534. In practice, we have found that combining together separate label maps lends itself to better compression later in the pipeline (as discussed in the next section). To borrow terminology from the field of Information Theory, the result has lower entropy.

Image Compression

A wide variety of image compression algorithms (sometimes called codecs) take advantage of the unique statistical properties of image data to increase the rate of compression. While a full overview of image compression is out of the scope of this post, we will touch on a few points that pertain to the problem at hand.

A cursory search for image codecs will return a wide variety of options including, PNG, JPEG, WebP, JPEG XL, and more. These codecs differ in several properties including the following:

Lossless vs. lossy compression support: Some codecs support lossless compression, some support lossy compression, and some support both. In most cases lossy compression results in measurably better compression rates than lossless compression.
Supported input formats: The different algorithms support different types of input. Typical limitations include the number of color channels that are supported, the number of bits per pixel that are supported, etc.
Compression quality controls: The codecs differ in the degree and nature of control they enable over the resultant compression quality. For example, by tuning the quality controls we can manage the trade-off between the compression rate, on the one hand, and the speed of encoding/decoding and/or loss of information, on the other hand.
Underlying compression algorithm: The algorithms underlying the codecs behave differently, optimize different functions, and exhibit different artifacts. For example, some algorithms might be more prone than others to the removal of specific image frequencies on which your algorithm relies.

The Structural Similarity Index Measure (SSIM) is a popular metric for evaluating the degradation in image quality resulting from a lossy image compression scheme. The SSIM value ranges from 0 to 1 with 1 implying an exact match with the original image. Other metrics for measuring image degradation include MSE and PSNR. As before, your goal should be to choose metrics that are able to predict the impact of the information loss on the performance of your data algorithms.

A few words of caution regarding the use of lossy image compression schemes are warranted:

While most codecs optimize for visual perception, your algorithm might be sensitive to elements in the image that are not visually apparent. In particular, relying on a high degree of visual similarity for evaluating a compression scheme should not replace an in-depth assessment.
Beware of websites that give off the impression that one particular codec is always better than all others. In practice, the relative performance of different codecs can vary greatly between domains of images (e.g., deep space images vs. medical images) and even between image samples within the same domain. It is highly advised that you conduct your own analysis on your own image datasets.
For video sequence data, you may feel compelled to adopt a video compression format. Video compression utilizes similarities between nearby frames to further increase compression rates. However, often times this results in significantly reduced quality (e.g., as measured by SSIM) compared to image compression. In some cases, you may be better off compressing each frame independently.

Our toy example includes two opportunities for applying image compression. First, we compress the image map using the classic lossy JPEG codec and with a compression quality setting of 95 (see here for details on setting the quality value). Next, we compress the label map. Since we expect our ML algorithm to be highly dependent on the accuracy of our data labels, we choose the lossless PNG compression format so as not to lose any label information. Note that while both JPEG and PNG are extremely popular formats, they are not known for providing the best compression rates. Although sufficient for the purposes of our demonstration, you might get better results using more modern image compression algorithms.

The code block below demonstrates the image compression using the Pillow package (version 9.2.0). We apply the SSIM metric using the scikit-image package (version 0.19.3).

from PIL import Image
from skimage.metrics import structural_similarity as SSIM

Image.fromarray(combined_label).save('label.png')
Image.fromarray(image).save('image.jpg',quality=95)

decoded = np.array(Image.open('image.jpg'))

ssim = SSIM(image, decoded, channel_axis=2)} # 0.996

The SSIM score in our example is 0.996, indicating a relatively low loss in the quality of information resulting from the JPEG encoding. Naturally, whether this level of image degradation is acceptable will depend on the sensitivity of the ML algorithm that consumes the data. Note that compression quality that we have chosen is relatively high. A lower quality rate would likely result in better compression but at the cost of a greater loss of image detail.

The following image displays the original input alongside the decoded output and the absolute differences between them (scaled by 20 for enhancement).

Impact of JPEG on the Photo by Joshua Sortino on Unsplash

Result

At this stage the image, label, and depth maps take up 341 KB, 205 KB, and 835 KB of storage space, respectively. The size of the tarball is 1.2 MB. By applying a general purpose ZIP algorithm to the tarball, this drops to 1.1 MB. This is less than half the size of the result of the trivial compression with which we started. The final compression sequence is summarized in the following code block:

import tarfile
from PIL import Image

combined_label = (label2 * 16 + label1).astype(np.uint8)
Image.fromarray(combined_label).save('label.png')
Image.fromarray(image).save('image.jpg',quality=95)

depth_new = depth.astype(np.float16)
with open('depth.bin','wb') as f:
    f.write(depth_new.tobytes())

with tarfile.open("final.tar.bz2", "w:bz2") as tar:
    for name in ["image.jpg", "label.png", "depth.bin"]:
        tar.add(name)

With this relatively simple sequence we managed to reduce the size of our data by 5.64X. This amounts to more than 80% savings in storage space. Applying this to your full dataset can have profound implications on your storage costs.

It is likely that we could go on and find additional opportunities for compression. However, the rate of compression of each additional action will most likely decrease. It is also likely that a different compression sequence would have resulted in an even better compression rate. Continued exploration may be warranted based on your existing storage costs and potential for savings.

Summary

In this post we have discussed the importance of data compression for data science in general and machine learning in particular. We demonstrated just a few simple compression techniques and measured their impact on the data storage size. As emphasized above, the methods we chose will not necessarily be the right ones for you. The keys to finding a good solution include: an intimate knowledge of the raw data types, a deep understanding of the ways in which they are consumed, and a good grasp of the relevant compression schemes.

As we begin the year 2023, we find ourselves deep into what is often referred to as the Big Data revolution. Proper data management, including the use of data compression schemes, is just one of the many important components of this revolution. Happy New Year.

Appendix: The Impact of Compression on the Data Application

Many data applications can be described as the continuous flow of extremely large amounts of data between different devices and device components. In deep learning training workload, for example, raw training data is streamed from storage to CPU workers for pre-processing and batching, training batches are fed from the CPU into the training accelerator and then streamed through the different phases of the forward computation graph, gradients are calculated via backward propagation, and, in the case of distributed training, data is communicated between the participating accelerators.

The Flow of Data in a Typical DL Training Step (by Author)

In an ideal situation, the data will flow through the system fast enough to keep all of the computation resorces fully utilized. However, sometimes you might find that your data flow is bounded by limitations on the bandwidth of the communication channel. This can lead to the presence of a performance bottleneck in the application and result in under-utilization of the compute resources. In this undesired situation, expensive resources remain idle as they wait for data input. Such issues can be resolved in a number of ways including: increasing the communication bandwidth (e.g. using different instance types with different specifications), changing up the application architecture, and/or reducing the size of the data.

If the size of your data is large and the communication bandwidth between storage and the application host is limited, you might be particularly susceptible to a data flow bottleneck. Storing your data in a compressed format reduces the potential for bottlenecks on the interface between the storage location and the application.

One potential tradeoff of Compression is the additional compute resources that are required for compressing and/or uncompressing the data. If your data application is already compute intensive, you might find that storing your data in a compressed format frees up a data flow bottleneck in one part of the application only to introduce a compute bottleneck somewhere else. Thus, data compression on a data application pipeline can turn into a delicate art in balancing the utilization of different resources.

Tags: Amazon S3 Big Data Cloud Storage Compression Optimization