How Data Scientists Save Time
Rumor has it that some people living among us have all the free time they need; if you're one of those four lucky individuals, you're probably meditating blissfully in a misty forest rather than reading these words.
The rest of us have a more complex relationship with time, efficiency, and the tension of balancing competing needs. A few hours can make the difference between shipping a machine learning project on time and missing a deadline. They can determine whether you get to spend an afternoon with friends, bake a cheesecake from scratch, rewatch the Succession finale, or… not do any of those things.
The internet is full of productivity hacks and time-saving tricks; we're sure you can find them on your own. Instead, what we offer you this week are pragmatic, hands-on approaches for speeding up workflows that data professionals execute every day. From choosing the right tools to streamlining your data-cleaning process, you stand to gain some precious minutes by learning from our authors, so we hope you use them well—or not! It's your time.
- Analyzing geospatial data can be a slow and elaborate process. For his TDS debut, Markus Hubrich highlights R-trees' power to dramatically improve the speed of spatial searches, and uses the (delightful) example of tree-mapping to illustrate his point.
- Many libraries and tools promise to improve Python's famously sluggish performance. Theophano Mitsa benchmarks a recent arrival—Pandas 2.0—against contenders like Polars and Dask, so you can make the most informed decision when designing your next project.
- Still on the topic of speeding up Python-centric workflows, Isaac Harris-Holt‘s latest tutorial shows how to leverage the nimbleness of Rust by embedding it into your Python code with tools like PyO3 and maturin. (Staying true to our theme, it's also a quick and concise post.)
- How effective are large language models when it comes to executing complicated, nuanced processes? The verdict might still be out, but one area where they're already showing promise is text summarization— and Andrea Valenzuela‘s recent guide explains how you can use them to generate high-quality summaries quickly and consistently.
- For Vicky Yu—and, we suspect, many of you as well—data cleaning can get tedious, fast. To breeze through this stage of your project, Vicky recommends creating user-defined functions (UDFs), which make it possible to simplify your SQL queries and avoid having to code the same logic over multiple columns in a table.
- CPUs or GPUs? If you work with massive amounts of data, you likely already know that your choice of hardware setup can have an outsized effect on the time and resources you'll need. Robert Kwiatkowski‘s helpful primer covers several use cases and maps the pros and cons of both processor options.
If you still have a few more minutes to spare, we hope you decide to spend them with some of our other recent standouts:
- How do ML models come into being and how to they transform over time? Valeria Fonseca Diaz explores the lifecycle of a model.
- In their new deep dive, Marco Tulio Ribeiro and Scott Lundberg advocate testing LLM-built applications using the same principles we would follow with any other software.
- Nazlı Alagöz recently shared an insightful reflection on the surprising parallels between academic and industry-focused Data Science practices.
- For an intuitive and accessible introduction to logistic regression, don't miss Igor Šegota‘s beginner-friendly explainer.
Thank you for supporting our authors! If you enjoy the articles you read on TDS, consider becoming a Medium member – it unlocks our entire archive (and every other post on Medium, too).
Until the next Variable,
TDS Editors