The Best Talks from PyCon US 2023

From April 19 through April 23, 2023, we attended our second PyCon – the largest annual convention for the Python programming language. Each year, Pycon holds several conferences worldwide, and we attended the US conference in Salt Lake City, Utah.
This year, PyCon celebrated its 20th anniversary, and during the opening session, a recap video showcased pictures and memories from past conferences shared by previous attendees. While this was only our second PyCon, we were thrilled to reconnect with friends we made last year and make new connections. Overall, we enjoyed the conference and wanted to recap some of our favorite talks related to data science.
If you weren't able to attend PyCon, don't fret! PyCon plans to make the talks available for everyone to enjoy on their YouTube channel. We'll link them here when they become available.
Here's a quick overview of the talks we'll cover:
- Feature Engineering is for Everyone! – Leah Berg and Ray McLendon
- Ned Batchelder's Keynote
- Why You Should Care About Open Source Supply Chain Security – Nina Zakharenko
- Building LLM-based Agents: How to develop smart NLP-driven apps with Haystack – Tuana Celik
- Into the Logisticverse: Improving Efficiency in Transportation Networks using Python – Uzoma Nicholas Muoh
- Generators, coroutines, and nanoservices – Reuven M. Lerner
- Civic Data Discussion
- Approaches to Fairness and Bias Mitigation in Natural Language Processing – Angana Borah
- 10 Ways To Shoot Yourself In The Foot With Tests – Shai Geva
- Python Meets UX: Enhancing User Experience with Code – Neeraj Pandey, Aashka Dhebar
Feature Engineering is for Everyone! – Leah Berg and Ray McLendon
Two days before the conference talks began, PyCon offered several different tutorials on a wide variety of topics. These tutorials were a great way for attendees to learn about a topic and actually apply it with hands-on examples in Python.
After presenting a natural language processing workshop last year, we were thrilled to be chosen again to teach a three-hour beginner workshop on feature engineering in Python. Typically, feature engineering is discussed only in the context of creating inputs for machine learning models. However, we took the opportunity to explain how it can also enhance data visualization, data quality, and interpretability.
Throughout the workshop, we covered how to explore and create features for discrete and continuous data. Attendees gained real-world experience by analyzing Technology product reviews and stock market data in a Google Colab notebook.
Now, let's dive into some of our favorite talks from the conference.
Keynote – Ned Batchelder
Ned Batchelder, an organizer of Boston Python, maintainer of coverage.py, and contributor at edX kicked off the first day of conference talks with the importance of communication for software developers.
While many conference attendees may have been surprised by the non-technical talk, we found it extremely refreshing and much-needed. Communication is an essential component of any collaborative environment, and Ned's emphasis on it at a tech conference served as an important reminder of its value in the field.

Ned's talk highlighted an essential point: every message, whether communicated in person, through text, over the phone, or any other medium, carries both information and sentiment. Regardless of your intent, people will interpret your message's sentiment based on various factors, including their history with you, their similarity to you, or their current emotional state.
Ned's tips for better communication included practicing humility, being explicit, and choosing your words carefully. As he shared examples of poor communication and how to improve it, we couldn't help but reflect on our own communication styles. It was a thought-provoking reminder that we've all likely given or received messages that were misinterpreted and that we can all benefit from striving to communicate more effectively.
Why You Should Care About Open Source Supply Chain Security – Nina Zakharenko

Many data scientists rely on open source Python modules such as pandas, numpy, and scikit-learn to carry out their work. However, it's easy to overlook the potential risks associated with these packages, including how they can be targeted by attacks and the impact this can have. If you're like us, you may not have considered these issues before, but it's essential to be aware of them to ensure the safety and integrity of your work.
Nina's talk gave an excellent overview of a wide variety of attacks in each step of the supply chain along with recent examples of those attacks.
- Unauthorized changes – An attacker fixes one bug while also introducing another (ex: Linux hypocrite commits)
- Compromised source code repository – An attacker gains access to a private system and makes malicious changes to the source code (ex: PHP self-hosted Git server)
- Build from modified source – An attacker builds the source code from a version that doesn't match the official repository (ex: Webmin)
- Compromised build process – An attacker gains access to the build system and injects malicious code (ex: SolarWinds)
- Compromised dependencies – An attacker compromises a widely used dependency to gain access to its dependents (ex: event-stream)
- Upload modified package – An attacker gains access to a system and can upload a modified package (ex: Codecov)
- Compromised package repository – An attacker compromises an entire package repository (ex: Browserify typosquatting)
- Dependency becomes unavailable – A package that many other packages rely on is no longer available (ex: left-pad)
Nina wrapped up the talk by sharing a tool called Scorecard that helps open source maintainers assess the security posture of their projects.
Building LLM-based Agents: How to develop smart NLP-driven apps with Haystack – Tuana Celik
Deepset is a European company with a background in search engines. They've integrated generative pre-trained transformer (GPT) models into their work and created an open source library called Haystack. The library lets you quickly and easily build an "agent" for a simple question and answer (QA) system.

While Haystack can be used beyond QA systems, the talk focused specifically on this use case. They began by explaining how Haystack turns documents into vectors or "embeddings", which allows the related documents to be quickly pulled from storage when a query is made. Using this method, the agent can answer queries in natural language similar to ChatGPT.
Using pre-existing models available on Hugging Face is a great advantage of Haystack. Even simpler models can give great results without the need for a full GPT-4 style model. By fine-tuning these models, performance can be further boosted to create a world-class QA system.
With these technologies, a fully self-contained system can be developed to achieve GPT-level performance without risking an organization's data using third-party APIs.
Into the Logisticverse: Improving Efficiency in Transportation Networks using Python – Uzoma Nicholas Muoh
Full disclosure, Nick is a friend of ours whom we met at PyCon 2022, and we really enjoyed his talk this year. Nick dove into the complex world of creating an efficient transportation system.
Nick began by discussing the different perspectives of transportation companies, drivers, and product companies. The challenge of transporting goods efficiently involves balancing the needs of shipping products, driver rest time, and reducing the number of empty cargo miles

We enjoyed how Nick introduced several Python libraries we hadn't used before like Google's ortools. It was a great reminder that many of the problems we try to solve with data science have already been tackled by other disciplines, such as operations research.
The talk also had a great component on using network graphs to visualize cargo hauling locations. This makes it easy to identify inefficient routes and find ways to optimize the overall transportation process. It was really helpful to see how such visualizations can be used to solve complex problems in logistics.
Generators, coroutines, and nanoservices – Reuven M. Lerner
As data scientists, we've had a brief introduction to generators from some of our software developer friends but admittedly haven't used them much. We enjoyed learning more about them from Reuven, a popular Python educator and author.
A generator allows you to iterate over a sequence of values on the fly rather than generating all of the values upfront. They're great for iterating over large data sets because they're memory efficient since they don't store all of the values in memory. A data scientist could use generators to read large data sets or generate infinite sequences of numbers.
A coroutine can consume and produce values. Coroutines can be used to create lightweight concurrent tasks that can pause and resume execution to allow other tasks to run. This can be useful for tasks such as network communication or input/output operations, where it's beneficial to switch between tasks instead of blocking the program until a task completes.
Finally, Reuven introduced the concept of nanoservices, which can be thought of as coroutines that sit within your program and can be accessed through an API (i.e. send()), similar to how microservices are divided into small parts with their own server and state. This approach allows for greater modularity and flexibility within your code, as you can access these small parts or "nanoservices" whenever you need them.
Civic Data Discussion
One of the unique aspects of PyCon that cannot be accessed online or after the event is the Open Spaces. These are designated rooms where attendees can gather to discuss topics of interest.
We attended an Open Space dedicated to Civic Data and were pleased to see a diverse group of individuals sharing their unique perspectives. We had a few key takeaways from the discussion.
Firstly we discovered that each jurisdiction has its own policy for acquiring civic data. While this data is available to the public, obtaining it isn't always straightforward. Some jurisdictions require the data to be obtained in person or on a physical CD, which can pose challenges for those seeking the data.

A second point discussed was that third-party contractors often aggregate data, but the contracts don't always require that the dataset must be open. As a result, previously free but hard-to-obtain data can become expensive, which was surprising to us.
Finally, we learned about the United States Digital Service (USDS), a government group that's working to raise the standards of digital services in the US. We were pleased to meet some USDS individuals and to hear about the fantastic work they are doing to change contracts, making civic data more openly available.
Approaches to Fairness and Bias Mitigation in Natural Language Processing – Angana Borah
As professionals in the field of Natural Language Processing (NLP), we have a strong interest in the topic of fairness and bias. This talk was great for its comprehensive coverage of the subject, from a high-level overview to a deeper exploration of standard metrics and techniques used to develop better NLP solutions.
Angana's passion for the work she's been doing and will continue to do was clear in the talk. She highlighted the importance of addressing fairness and bias issues in NLP and the inadequacies of current methods for training. Further funding and research is necessary to develop better systems to address these issues.
For a deeper dive into this topic, we recommend The Ethical Algorithm by Michael Kearns and Aaron Roth. Angana did an exceptional job of covering fairness and bias, but this book further expands on topics like privacy which is a major concern with the latest large language models (LLMs) being used across the internet.
10 Ways To Shoot Yourself In The Foot With Tests – Shai Geva
Data scientists often lack a software development background, which can lead to neglecting best practices like writing tests. Shai's talk provided a fantastic introduction to the significance of not only writing tests but writing them effectively. In his talk, he covered 10 signs of a bad test.

- There are no tests – If you haven't written any tests for your project, start small and simple.
- If it doesn't fail, it doesn't pass – Always make sure your tests fail how you'd expect them to.
- Testing too many things – Each test should confirm a single fact about the behavior of your code.
- Unclear language – Use decisive, explicit, and specific language in the names of your tests.
- The devil's in the details – Try to isolate all information in the test itself rather than bouncing around the code.
- Tests aren't isolated – You don't want the results of your tests to change based on the order in which you run them.
- Improper scope – Try to use cohesive behavior tests.
- Test doubles everywhere – Try not to use mocks, patches, etc. because changes to your codebase can quickly cause them to be out of date.
- Slow tests – Aim for three-second tests and run them in watch mode to quickly identify and address bottlenecks.
- Wrong priorities – The goal of tests is to catch bugs. They should have the following qualities in this order: maintainable, fast, and strong.
We loved the simplicity of Shai's talk and will definitely be keeping these principles in mind the next time we write tests.
Python Meets UX: Enhancing User Experience with Code – Neeraj Pandey, Aashka Dhebar
Neeraj's talk highlighted the synergies that exist between user experience and data science. User experience (UX) designers and researchers typically run a variety of experiments in which they collect data about design choices and/or user behavior. Data science techniques using Python can be used to help analyze the data.
Neeraj's case study on a shoe store perfectly illustrated these synergies. He demonstrated how one could use k-means clustering to personalize recommendations, promotions, and content, and how natural language processing (NLP) techniques like sentiment analysis could identify user pain points from customer feedback.
Clickstream analysis was used to optimize website navigation and improve user engagement, while market basket analysis helped adjust store layouts based on frequently purchased items. Neeraj also showed how NLP techniques could automate user research survey results, and statistical techniques like hypothesis testing could assess the significance of A/B testing results.

Overall, his talk highlighted how data science can help improve UX and drive business success. Although the techniques he discussed were not new to us, we appreciated his clear and engaging presentation style with beautifully crafted slides.
We've previously shared some of these techniques with our company's UX department, but we believe that Neeraj's case study will help further drive home the points we've been trying to make. We're excited to share his presentation with our colleagues in UX.
Conclusion
We had a great time teaching our second workshop at PyCon and received valuable feedback that will help us improve. We're excited to incorporate this feedback into an extended version of the workshop where we teach our data science process to help you increase the success rate of your projects (more on that here).
While we felt like there weren't as many data science talks as last year, we still found value in the sessions geared toward general software development.
If you're a Data Scientist looking for a Python conference to attend, we recommend this one if you're interested in becoming a more well-rounded programmer and learning more about software development. The community is welcoming regardless of your skill level, and it's an excellent opportunity to expand your network.