Pandas: My Experience Contributing to a Major Open Source Project

Author:Murphy | View: 30108 | Time: 2025-03-22 21:59:46

Open Source

Open Source projects generally rely on the contributions of a multitude of people to keep them bug-free, secure, up to date, and constantly moving forward.

Who are these people? Well, "they" could be you and me! There is nothing stopping us from contributing, apart from a little time and effort. So, would it be worth our time? and what exactly is involved if we decide to give it a go?

I decided to go through the process of contributing to a well known, and heavily utilised, open source library – Pandas.

Hopefully, what I experienced along the way will give an insight into what is involved, and potentially highlight a way in which you might be able to contribute too. Or, at the very least, it will provide a good idea of what goes on behind the scenes!

Introduction

In one of my previous articles I highlighted a bug that I had discovered when using the Pandas library to generate boxplots.

I initially looked up the bug to see if anyone else had the same problem. I found a stackoverflow post, which appeared to be about the same problem I was encountering:

set boxplot parameters in mplstyle or matplotlibrc

That post is getting on for 6 years old, and as far as I could tell there was no open bug report on the GitHub issue tracker for pandas.

…so I thought. Why not try to fix the problem myself?

I've always been aware of the fact that open source projects exist, and by definition people must contribute to them. So why shouldn't I try to contribute too? I use free libraries all the time, why not give something back?

I also thought it would be an interesting process, and would give me a chance to see how open source projects the size of pandas actually operate internally.

How high is the bar to entry when contributing?
Would I be able to figure it out?
With so many different contributors, is the code like spaghetti, or is it really slick?
What systems do they use to deal with so many contributors?
How is the interaction with the maintainers?
…and, will I make a mess of it!

Well, it happened, so let's see how it actually panned out.

What exactly is an "Open Source Project"?

Just in case you need a primer…

If a project is "open source" it essentially means that all of the code used in the project is publicly accessible / viewable. I would describe this as the bare minimum description of "open source".

all of the code used in the project is publicly accessible

However, it can sometimes be the case that open source projects have licenses attached to them that allow the code to be used for anything, free of charge (i.e. commercial use, copying, adjusting, sharing etc.). For example pandas uses the BSD 3-Clause License, which is very permissive.

This permissiveness opens the door for collaboration, and it is quite common for open source projects to be hosted on portals such as GitHub, which are designed to aid collaboration, project versioning and tracking.

Filing a bug report

Why are we talking about bug reports? I was meant to be fixing things, contributing? Right? Not just reporting a bug for someone else to deal with.

Well, as it turns out it wasn't that simple, as I'll get to, but it is also worth pointing out that the first port of call should be checking to see if the bug has already been reported.

With pandas, you can use their GitHub repository to search the open "issues":

Issues · pandas-dev/pandas

Which is exactly what I did…but I found nothing in the issue tracker that represented this exact problem.

Bug report, or fix?

As there is nothing reported, do I put in a bug report explaining the problem first, or try to create an immediate fix (typically called a "pull request", more on that later…)?

The most logical course of action would be to open a bug report first. This is advantageous, as it provides a chance for review of the problem with the official maintainers of the project (typically knowledgable people). Like all good collaboration, this may give some new insight into what an appropriate "fix" might look like.

It also makes the problem permanently visible to the whole community. As I said, the particular bug I am looking to fix has been around for nearly six years, and that could partly be due to the fact that it hasn't been reported to the project in the first place.

Before putting in the report…

As this was my first foray into contribution for such a significant public project, I thought it might be wise to see if I could understand how to fix the bug first, even if I was just initially going to put in a bug report.

So that is what I did. I went and reviewed the relevant code (further details on how to go about that later), and came up with an appropriate fix for the problem.

However, understanding how a fix might work threw a spanner in the works, and also highlighted something important about large collaborative projects.

It is very important that the user experience remains consistent, and backwards compatible. As an end user of the library it would be quite annoying if code you wrote to produce a plot a year ago, suddenly produced a slightly different looking plot today with no warning.

It turns out, that to fix the problem I had found, would almost certainly change the status quo. Therefore a bug report, and discussion with the maintainers was the only option.

So, in went the bug report (I go into quite a bit of detail in the bug report if you are interested in knowing exactly what the problem was):

BUG: Boxplot does not apply colors set by Matplotlib rcParams for certain plot elements · Issue…

I waited…

No response to the bug report as yet, over a month and a half later.

I want to be clear here. I am not complaining that no-one responded to the bug report. It is an inevitable part of collaborative development that priorities need to be made, and that type of problem may not be a priority right now. Resources are not endless (a good argument for more people to get involved!).

Resources are not endless (a good argument for more people to get involved!).

It may also be the case that due to the nature of the bug (i.e. no nice clear path to a simple fix) it needs more thought, or the expertise of someone who isn't available to the project at the moment.

To further drive home this point, I have found a similar discussion on a separate bug report, with that exact outcome of the proposed fix being (kind of) denied as the status quo would be changed:

Hexbin plot ignores matplotlib default colormap · Issue #31871 · pandas-dev/pandas

Regardless, I wanted to get stuck in, so I went searching for an existing bug to fix.

Finding an alternative bug to work on

Screenshot of the bug I found to work on from 2014 – Image by Author

There are currently "3.6k" open issues according to the pandas issue tracker. Plenty of choice.

I went searching through the bugs. I decided to look within the bugs tagged as "visualization", as I had become somewhat familiar with this particular area of the codebase based on the previous bug I looked into. I also sorted by date, looking for problems that had been around a while, as there was less chance of standing on someone's toes while I get familiar, and learn.

I finally decided to go ahead with a bug that had a pull request (i.e. a fix) offered that did actually fix the problem, but was never merged into the main codebase.

The reason the pull request hadn't been merged is that some updates were requested by the maintainers, but were never provided. So the process had never been completed. In technical terms, the pull request went "stale".

This basically means I had the answer to the problem already! Although, isn't that kind of cheating, or at least missing the point of the challenge?

Well actually this situation is perfect for a first go. Regardless of the fact that I have the solution handed to me, I still have to go through the process of following the correct systems and checks to ensure the fix makes it into the main codebase. Which is a learning process all of its own.

Regardless of the fact that I have the solution handed to me, I still have to go through the process of following the correct systems and checks to ensure the fix makes it into the main codebase.

That is typically why it is suggested that as a first port of call it would be a good idea to contribute to the documentation, rather than the codebase.

By updating documentation it is still required to go through the same new unfamiliar processes, but you don't have the added complication of trying to fix a coding problem at the same time.

So I have a bug to fix, what next?

Attempting my first pull request

This section will outline the processes I had to go through to get to the point where I could successfully submit my first "pull request" (or fix) to the project.

This is not meant to be a step by step guide, it is just intended to give you a good overview of the processes involved, so that the ins-and-outs become a little clearer.

What is a "pull request"?

A "pull request" is typically the way that code is "offered" to a project that utilises the git versioning system. It is a request to the maintainers to "pull" your code (in this case the fix) into the main codebase of the project.

Note: you will need to be somewhat familiar with git, and more specifically, working with GitHub to be able to contribute in the first place. So if that is not something you are familiar with, then I would suggest getting the basics of using git nailed down first. There is plenty of very specific guidance as to commands you will need in the pandas documentation, but it will still be helpful to have some knowledge of how git works.

Setting up

Pandas has a very comprehensive and well laid out guide available to all contributors, it really is a great reference. It is detailed, and will guide you through the whole process literally step-by-step. As you can work out from the table of contents below:

Screenshot of the pandas contribution guidance table of contents – Image by Author

If you have some spare time I would thoroughly recommend giving it a read:

Contributing to pandas – pandas 2.2.2 documentation

To summarise, it is basically required to set a few things up before you can dig in to fixing a bug (or even contributing to the documentation):

Fork the pandas development branch to your own GitHub account
Clone your fork locally to your computer
Setup a working environment. This step is thoroughly covered in their documentation. Pandas recommends using mamba to create your working environment, although lots of different configurations are supported
Build and install the development version of pandas that was previously cloned from GitHub – this is basically a one-liner
Create a new git "branch" to work on your problem

Setup done. It is now time to get stuck in.

Finding the source of the bug

The biggest challenge is becoming familiar with the code base, and how things are laid out. Although this also has some advantages, as it forces you to understand how things work, and how they fit together.

By digging into the code base, it will expose you to real world techniques and approaches to solving coding problems. Some of which you may not have seen before. Overall, I found it to be a great learning experience.

The bug had now migrated to a completely different file, so it still took some investigation.

In this particular case, as I have mentioned previously, the fix was already provided in the original pull request that went stale. However, as I had picked out a bug that had been lingering around in the error tracker for years, the code base had changed quite a lot in the meantime. The code related to the fix, and hence the bug, had now migrated to a completely different file, so it still took some investigation.

Fixing the bug

I applied the new code:

# GH 7023 allow setting plot style when using errorbars
if style is not None:
    kwds["fmt"] = style

Yeah, that was it! Just goes to show that the effort isn't necessarily the actual adjustment, it is mostly the detective work to find the source of the issue.

Then I checked to see if it had the desired effect, by checking it would remove the bug. Essentially, checking the code provided in the bug report no longer caused an error.

It worked perfectly!

Writing and Running Tests

The main reason the original pull request went stale is that the author of the fix didn't provide any tests.

For collaborative projects at this level robust testing is essential.

For collaborative projects at this level robust testing is essential. This is made quite clear by the extensive amount of tests that already exist in the pandas codebase.

So the first challenge is to make sure the changes that have been made by the fix don't break the existing tests. For this, the pandas project uses pytest, pretty straightforward really. Run pytest against the relevant test files, and check the output has no "fails". For example:

pytest pandas/tests/plotting/test_boxplot_method.py

An example of tests running successfully – Image by Author

If that goes OK, then typically you will need to write at least one test to ensure that the bug that has just been fixed doesn't reoccur (i.e. you don't get any regressions)

For this fix I wrote the following test:

def test_errorbar_plot_line_style(self):
        def _check_line_style(ax, expected):
            for line_num, line_data in enumerate(ax.get_lines()):
                received = [
                    line_data.get_linestyle(),
                    line_data.get_color(),
                    line_data.get_marker(),
                    line_data.get_markeredgecolor(),
                    line_data.get_markerfacecolor(),
                ]
                assert received == expected[line_num]

        # GH 7023
        data1 = np.array([9, 3, 5, 1, 7])
        data2 = np.array([1, 2, 2, 8, 4])
        err1 = data1 * 0.1
        err2 = data2 * 0.1
        df = DataFrame({"data1": data1, "data2": data2})
        err_x = DataFrame({"data1": err1, "data2": err2})
        err_y = DataFrame({"data1": err2, "data2": err1})
        expected = [[":", "r", "o", "r", "r"], ["--", "g", "v", "g", "g"]]

        # check for single line
        ax = df["data1"].plot(xerr=err_x["data1"], yerr=err_y["data1"], style="or:")
        num_lines = len(ax.get_lines())
        _check_has_errorbars(ax, xerr=num_lines, yerr=num_lines)
        _check_line_style(ax, expected)

        # check for two lines on a single plot
        ax = df.plot(xerr=err_x, yerr=err_y, style=["or:", "vg--"], subplots=False)
        num_lines = len(ax.get_lines())
        _check_has_errorbars(ax, xerr=num_lines, yerr=num_lines)
        _check_line_style(ax, expected)

        # check for two lines, each on separate subplots
        axes = df.plot(xerr=err_x, yerr=err_y, style=["or:", "vg--"], subplots=True)
        for ax_num, ax in enumerate(axes):
            num_lines = len(ax.get_lines())
            _check_has_errorbars(ax, xerr=num_lines, yerr=num_lines)
            _check_line_style(ax, [expected[ax_num]])

A lot more code than the actual fix! However, having to do this does give a good introduction into testing if you are not familiar, as you have access to a lot of examples within the existing codebase.

You may also notice that the code for the method _check_has_errorbars doesn't appear in the code of my test. This is because it is an existing method within the testing repository.

This happens quite a lot (even outside of the tests section of the main codebase), and can provide great examples of how to go about writing good methods/functions. It also reduces the amount of work required to write the code in the first place.

Note: In theory, the correct approach would be to write the tests first, then write the new code to make the new tests pass.

Continuous Integration (CI)

Now that I have a fix that satisfies all tests, even the new test I wrote. It is time to make a pull request, right?

Well not quite. Again, due to the size of the pandas project, and the large amount of contributors, the pandas project has incorporated Continuous Integration (CI) into their workflow.

So what is CI exactly:

In software engineering, continuous integration (CI) is the practice of merging all developers' working copies to a shared mainline several times a day.[1] Nowadays it is typically implemented in such a way that it triggers an automated build with testing.

–wikipedia.org

The last part is particularly relevant here:

…automated build with testing

When the pull request is finally sent, it will be automatically checked. If it fails those checks then it will be marked as such, and is automatically blocked from merging into the main repository. Therefore, the last thing that needs doing is to run these checks locally to ensure there are no problems before issuing the pull request.

The pandas project has this covered with a simple one liner:

pre-commit run --hook-stage manual --all-files

Again, nice and easy. It will even lint your code for you.

Side Note: the pandas project is one of the few major open source projects that has started using the ruff linter, which is written in Rust, and as a result is blazing fast!

Submitting the request

Screenshot of my first "Pull Request" to the Pandas project – Image by Author

Push your local branch to your GitHub clone of pandas, then click a button which automatically appears on your GitHub account after the push.

Fill in some details about the pull request, for which a template and guidance is provided (see the picture above for reference), and done.

Finally, submitted! Ultimately, still sat there unnoticed to this day.

Such is life…so I tried again!

Finding a second bug to work on

Same criteria as the previous bug. I found the following:

Screenshot of the second bug I found to work on – Image by Author

Spoiler alert! You may notice that the bug in the image above is stated as "Closed". Which effectively means this time I got a response from the maintainers, and had the pull request merged.

So now let's pick up where we left off with the first attempt, and see what the rest of the process looks like.

The pull request

Screenshot of my second "Pull Request" to the Pandas project – Image by Author

BUG: Fix error for boxplot when using a pre-grouped DataFrame with more than one grouping by…

The fix and the tests

For reference the fix and the tests that finally went through look like this:

The fix in the code that was accepted (red = removed, green = added) – Image by Author

The tests:

@pytest.mark.parametrize("group", ["X", ["X", "Y"]])
def test_boxplot_multi_groupby_groups(self, group):
    # GH 14701
    rows = 20
    df = DataFrame(
        np.random.default_rng(12).normal(size=(rows, 2)), columns=["Col1", "Col2"]
    )
    df["X"] = Series(np.repeat(["A", "B"], int(rows / 2)))
    df["Y"] = Series(np.tile(["C", "D"], int(rows / 2)))
    grouped = df.groupby(group)
    _check_plot_works(df.boxplot, by=group, default_axes=True)
    _check_plot_works(df.plot.box, by=group, default_axes=True)
    _check_plot_works(grouped.boxplot, default_axes=True)

_Note: the _check_plot_works method in the test above is an existing method within the test repository of pandas. Always remember to check, both within the testing suite, and in the main code base, if there are any existing methods that may be of use. Don't repeat work that is already done!_

The interaction with the maintainer

After the pull request is submitted it then needs to be reviewed by a maintainer. As you have likely gathered by now, there is no fixed timeline for a response, so be patient!

Another great thing about open source projects is that even the discussions are public, so if you want to know exactly what was discussed between me and the maintainer, then you can read it yourself here:

BUG: Fix error for boxplot when using a pre-grouped DataFrame with more than one grouping by…

There are no private messages, or offline discussions. Everything that was discussed is there in plain sight.

Changes are expected

Don't expect to nail everything first time (or ever really), and don't be offended when the maintainer asks you to change something. Your fix may indeed work perfectly, but it may not be optimal. Or, you may not have all the context about the project in general that the maintainer has.

…don't be offended when the maintainer asks you to change something.

This is exactly what happened to me. If you take a look at the initial pull request I made, you can see that the original fix was eventually completely changed. The original looked like this:

Initial pull request fix – Image by Author

It was a very simple fix, and it worked. Essentially, the "key" needed to be a string, which it sometimes wasn't in the original code. pprint_thing() is an existing pandas codebase method that converts objects to strings.

The maintainer had a suggestion:

Suggestion of more optimal code for the fix – Image by Author

Bear in mind that the line he is referring to ret = pd.Series(dtype=object) was already existing in the code base, not written by me. Regardless, he offered a more optimal fix for the problem.

This is a great example of how collaboration really works:

I found and fixed the problem (although not optimally)
A second pair of more experienced eyes (the maintainer) managed to provide an optimisation
I applied the suggestions in the form of an update to the pull request, allowing the maintainer to concentrate on other things

Without my initial effort, this bug would likely have sat in the bug tracker for a very long time. It has already been there since 2016 (8 years!). However, without the input of the maintainer, the codebase would be subpar in the long run. The project benefited from the combination, the collaboration.

…and here it is finally being merged into the main repository of pandas:

There it is the fix was finally merged into main – Image by Author

Tidying up

Now everything has been merged, the branch I was working on can be deleted:

Delete the branch used to work on the bug – Image by Author

…and that's it!

Why you might want to contribute too

After going through this process, I can thoroughly recommend giving contribution to an open source project a go. It doesn't have to be pandas, but I can attest to the extensive documentation to guide you along the way, so it is potentially a good place to start.

I can thoroughly recommend giving contribution to an open source project a go.

The experience is especially valuable if you have not had exposure to a large real world codebase, such as that in the pandas project.

From my point of view it was certainly an eye opener, and a great learning experience:

How the code is structured, and the file tree laid out
The extensive set of tests that exist guarding the codebase from regression and errors
The automated testing and continuous integration to ease workflow
Performance enhancements in development with the use of pip/meson allowing auto rebuilds of the codebase with multicore usage
Using git (and GitHub) to collaborate, and integrate new code
Collaboration and discussion with other project contributors

…and yes, in some cases the code appears to be sub optimal, and a little disjointed. I think this slight fragmentation is inevitable with so many different people involved. However, I believe the systems that are in place do a good job of keeping that to a minimum. Not an easy task by any measure.

Also remember, there is no rush, there are literally thousands of bugs to contribute to, and more coming every day. Plus, as you can see from my own contribution, the code changes can be very minimal if you choose the right problem to get started with. Or, just contribute to the documentation, which is significantly more important than you might think.

Conclusion

All in all it was a great experience, and gave a great insight into the workings of one of the libraries that I, and many other people, use on a regular basis.

You really can make a difference to the progress of a large project like pandas, it just requires a little patience to get up to speed, and donating a little bit of your time.

You really can make a difference to the progress of a large project like pandas.

It is also likely that you will benefit in some way, whether that is picking up coding technique, or improving your project testing and automation systems. Even your git knowledge may improve, and with the continued rise of open source projects, and collaboration in general, that is a great skill to have.

I personally have no doubt that I will try to make some time to contribute again in the future. Maybe this time with better optimised code!

…and finally, a quick thank you to Richard Shadrach for his suggestions and patience with my first pull request for the pandas project.

Tags: Deep Dives Getting Started Open Source Pandas Programming