Should We Be Virtualizing Our Data Science Systems-or Not?

As utilizing "big data" is more and more relevant for problem-solving across every industry, data repositories of homelab and data-lake scale alike require more parallelized computing power to extract, transform, load, and analyze data than ever before. While creating my own homelab, the decision to create my parallelized setups over virtual machines or natively on hardware left me stumped, and I struggled to find performance comparisons. In this article, we'll explore some of the pros and cons of each setup, as well as a side-by-side performance and benchmarks of each methodology both virtual and native.
Introduction
Many of parallelized compute clusters include multiple nodes, or computers designated to process tasks distributed over them in a cluster. Managing such nodes can be a major headache, hence why Data Engineering is so lucrative compared to their analytical counterparts. Typically, companies will manage entire fleets of clusters, which would make it almost impossible to give individual attention to individual nodes, and instead "high availability" setups with tools such as Proxmox, Kubernetes, and Docker Swarm are requirements for the modern enterprise. You've likely already interacted with these clusters and not realized this week, even – the chicken sandwich I had for lunch from Chick-fil-A is famously fulfilled via an edge-computing Kubernetes cluster with their point-of-sale system.
There are many benefits to computing in virtualized machines, including:
- Entire operating systems can be rapidly deployed from corporate servers to the field nearly instantaneously
- Images can be backed up in real-time
- Deployments can be containerized to limit scope and increase security
- In the event of hardware failures, systems can be migrated with minimal downtime
These are not new concepts by any means, but with a growing need for data analysis at every level of organizations, the way parallelized deployments are accessed can and should vary, as the downside of virtualization is generally the further from bare metal you get, the more your system's performance will be impacted. While one developer working on an Excel file may not be impacted, data analysis in the gigabytes or even terabytes needs to carefully consider how and when to use virtual tools, and to build setups that keep processing capability in mind.
Setting up our comparison
To put this to the test, we can compare the setup of a small to medium-sized organization using enterprise hardware readily available (I can't afford the good stuff). In my homelab, I have a computing cluster built from multiple refurbished enterprise units. I have a few other articles linked below on how I built this setup, and what I use it for, but now let's compare the performance between a virtual system and a bare metal system and measure the impact of virtualization specifically.
A Complete Guidebook on Starting Your Own Homelab for Data Analysis
Building Distributed Machine Learning Models on a Homelab Cluster with Python
Since writing the above articles, I've upgraded my setup slightly by adding in six HP EliteDesk 800 G3 Minis with Intel Core i7–7700 processors, 32 GB of DDR4–2400 RAM, and 256–512 GB SATA III SSDs. I purchased these units cheap at around $80 a unit from an auction site and paid around an extra $30–40 a piece to trick them out with new RAM and hard drives. The processors are all 65W models with 90W power supplies. The processors are even by today's standards nothing to snuff at, turbo boosted over 4Ghz and 4 cores, 8 threads.
For the comparison today I have two nodes, side by side. One node is running Proxmox, a Linux OS that is optimized for virtualization and deployment, running a Windows 10 Pro VM and the other node is running Windows 10 Pro on bare metal. There is no "right" OS to use, as it is heavily dependent on the tools one prefers using, but each has its own pros and cons.
Proxmox
One of the pros of Proxmox is it claims to only marginally impact baseline processing. At rest, we can see the hypervisor I've deployed on a node is idling with very low resource usage. The below screenshot captures the performance summary of only one node. We can see at idle only a fraction of a percent of the CPU is being used, which correlates to very limited wattage (and power bills) as well.

Once a guest OS is deployed, however, it's a totally different story. Resource utilization at that point is determined nearly entirely by the VM configuration.
A few of the notes I've learned from playing around with Proxmox are:
- There is a fairly steep learning curve. This is an enterprise tool, and while plenty of documentation exists for using Proxmox, expect to spend a significant amount of time reading the documentation, forums, and even Reddit threads.
- On a similar but positive note, plenty of documentation is available for solving problems as they arise, and the community built around the platform is strong.
- Proxmox, while having a very intuitive GUI, still requires some "thinking outside of the box" to solve problems. For example, once I created a Windows VM and tweaked it to the standard I wanted, I couldn't "drag and drop it" to another node with the same ease I had starting the image for the first time. I had to create networked attached storage (NAS) by adding an external drive to a broken laptop I have running Windows as an SMB share (some folks may remember the fabled glowing laptop in the corner of my apartment from my first article). This storage acted as a middleman and backup repository for cloning and migrating my VMs.
- I'm not a big Linux guy. I know, I know, I really ought to dive a little deeper into it, but the convenience Macs and Windows PCs have offered me over the years have led me to struggle at first whenever trying to complete an operation with a CLI that I normally can just fumble through by clicking.
- Proxmox is very easily scalable. Adding my first node to a cluster or "datacenter" as it's referred to took a little while for me to figure out, but adding all of the other nodes took no time at all. Out of the box I was able to customize each node the way I wanted by completing admin tasks such as assigning static IP addresses. Deploying VMs took mere minutes once I got the hang of it too.
- It's pretty cool, and I really geek out on the dashboard that presents all of the operating statistics that I closely monitor during operation. Below in the screenshot, we can see the dashboard monitoring not only one of the node's usage but also notes on our other nodes as well. Being able to flex between nodes in our data center is huge when monitoring on bare metal may require remoting between instances.

Finally, one thought I had in retrospect is that no matter how much time I've spent configuring Windows VMs to operate exactly how I want them (it took me all night earlier this week to configure nested virtualization the way I wanted to get Docker to work), there is always one more barrier or bottleneck.
Containerizing
I won't even measure a comparison for Docker, as the containers I was trying to spin up in the VM (a Minecraft server for an obligatory bi-annual Minecraft night with friends from college) would not even reach an acceptable level of performance (the server could not keep up and was unplayable). While my weekend plans were slightly foiled if I intended to use nested virtualization, there are practical applications that could be impacted as well.
One tool I frequently use for work and play is PyCaret, a Python library dedicated to ensemble machine-learning models. ML models frequently have processor or architecture-specific caveats, such as PyCaret not working on M1 Macs due to the ARM architecture, Tensorflow not being optimized for a Radeon graphics card I was using, and Autogluon not building on my i7 Mac (I don't even know why). As a result, frequently these packages are containerized into Docker applications for portability. I've also been looking into DynamoDB localized in docker to take advantage of a powerful AWS NoSQL architecture without needing to pay the big bills associated with it in the cloud. Time and speed are the selling point of these tools, and the impact of nested virtualization on Docker is huge (at least on these PCs). In fact, the performance decrease is compounding at each level of virtualization, with performance decreases of 10% or more at each level.
One added reflection to share with folks who would point it out is that LXCs (Linux containers) that can run Docker are running on the host OS itself, so large programs like ML models that could cause kernel panics, like memory swap failures, will kill not only the container but also the OS (instead of just a guest OS alone). Therefore, I wasn't even considering using them here, although they are definitely a useful tool for lighter applications.
Even without measuring, we can see that getting as close to bare metal as possible improves the performance of our tools. Some folks however have been able to produce great performance and remain virtualized. AWS Nitro, for example, is a real differentiator in the field that has contributed to Amazon's wide-scale computing and data warehousing success, albeit, at a significant cost that makes some Data Science tools such as Sagemaker cost what I paid for each of my desktops in one month for one notebook. We can see below that one standard instance of a Sagemaker studio notebook for eight hours a day, five days a week with similar specs to our machine (even with a limited clock speed too) costs approximately $75 a month. All in all, each of my units costs maybe $100–120 a node with their upgrades, and power is affordable at 10–15W idle and 65W at peak. This roughly equates to maybe $2–3 a month in electricity. This is nearly an order of magnitude of savings over the entire year.

That being said, I'm sure as time progresses and better, faster Hardware is available on the consumer and second-hand market, the gap between virtualized performance and physical performance will shrink. If Intel wants to send me an i9–13900K or NVIDIA AI an RTX 4090 though, I'd be happy to put it to the test and report back to everyone. In the meantime, I'll settle for my HP mini pcs and AWS free tier for my data analysis and virtualization needs.
The Comparison
For actually comparing virtual versus physical performance, we will perform both a generalized benchmark test, and then a Python performance test on both the Windows 10 VM and physical system. I'll note here that I am allocating the same number of cores and RAM to both the VM and PC (although this feels a little risky to me on the VM side, as I have had issues allocating "all" cores because it has impacted host hypervisor performance before which cause a system failure).
Without further delay, below is a snapshot of the benchmark built off of userbenchmark.com for the VM, run locally. Below we can see that the VCPU is performing well below baseline and average, and our RAM is just slightly so. This is indicative that either our CPU is not being well utilized, or there is a significant amount of overhead on the CPU hosting the VM during these mathematically intense operations during testing. See the screenshot for specific metrics such as integer calculation performance.

Not looking too hot overall, although the unit itself is putting out some BTUs right now.
Let's also now assess performance by running a simple Python script as a baseline (and note that this is running single-threaded due to the GIL). The unscientific Python script below I made up to create a rough "speed calculation" for comparing relative performance.
In two cells we are first:
- Counting to an arbitrarily large number and adding each number to a list
- Multiplying increasingly large numbers in a loop
Each of these tests is timed and repeated to come up with a baseline of relative performance for comparing the speed of processing and memory IO. The code snips are also below, in case you are interested in running it yourself to compare further.
def test1(n):
l = []
for i in range(n):
l.append(i)
n = 100000
%timeit -r 5 -n 1000 test1(n)
def test2(n):
for i in range(n):
i * (i-1)
n = 100000
%timeit -r 5 -n 1000 test2(n)

The first script finished in 8.84ms and the second finished in 11.5ms on average. We'll compare these numbers on the VM to bare metal soon.
After running our scripts, we can see that a good chunk of RAM is in use, however, the CPU is not tracking much utilization at all, although I'd be concerned over rubber banding if attempting to distribute this task over multiple threads. 12.5GB being used at idle is a significant amount of overhead and should be optimized with further research.

Now for bare metal…
Running Windows 10 Pro natively on actual hardware we can see significantly better benchmarks for performance using the same test suite we used for the VM.

Our processor is tracked at performing nearly twice at integer comparison in Windows natively as virtualized; it really blew the VM out of the water. We are much closer to our baseline and the process average in general. As for our RAM, we had significantly better read and write speeds, and when operating on a single core our throughput was nearly 3x better. Operating on bare metal truly makes a huge impact on IO and processing speed.
When running our Python tests, we noted similar jumps in performance. Our test scripts ran twice as fast as they did virtualized at 4.06ms for test 1 and 6.04ms for test 2. This is half the speed of the original tests on the VM.

Running unvirtualized, we can also see the used RAM is half of what the VM occupied at idle. All in all, this shows that running on bare metal can lead to significant improvements in processing and memory performance compared to the same hardware running virtual machines.

There is no one-size-fits-all solution to every team's unique data science needs. For the enterprise, it may make sense to spend more on virtual analytical tools where security and performance can be closely monitored. Smaller companies may be able to leverage cloud tools too, depending on the size of their wallets. For individual and small research teams, however, building on bare metal may be a necessity for optimal performance.
The strategy I used for building my projects and pipelines is not to focus on managing particular hosts and nodes, but rather to keep a backup of a fresh install of Windows (minus the bloatware) that has exactly what I need preinstalled – certain Python packages, code redistributables, server connections, etc. The rest of the project is centralized in a code repo that I can copy and deploy to nodes for processing at runtime. Gigabit connection speeds that are ubiquitous to most computers from the mid-2010s forward are more than fast enough to transmit data for data science libraries and packages. Therefore, the need for high-availability computing and OS uptime is decreased, as I manage the hard drives in bulk and do not make large changes locally. Some services still require some active management, such as containers running in Docker, but those would require fairly active management anyway, and having it on my native Windows 10 Pro installation makes more sense for how I want to spend my time because stuff is going to break either way.
What do you think? Where do you run your code? What tools and platforms do you prefer to host your data science workflows? Let me know below, or feel free to network with me on LinkedIn!
Curious about the hardware I'm using? Check out my reviews at www.willkeefe.com