Benchmarking Pytest with CICD Using GitHub Action

Author:Murphy | View: 26659 | Time: 2025-03-22 22:33:18

"Your code is slow" is something that is easily said, but it would take a lot of trial and error and testing to find out which part of the code is slow, and how slow is slow? Once the bottleneck of the code is found, does it scale well with an input that is 100 times or 1000 times larger, with results averaged across 10 iterations?

This is where pytest-benchmark comes in handy

Complementing the idea of unit testing, which is to test a single unit or small part of the codebase, we can expand on this and measure code performance easily with pytest-Benchmark.

This article will touch on how to set up, run, and interpret the benchmark timing results of pytest-benchmark. To properly enforce benchmarking in a project, the advanced sections also touch on how to compare benchmark timing results across runs and reject commits if they fail certain thresholds, and how to store and view historical benchmark timing results in a box plot and line chart!

Pytest with Marking, Mocking, and Fixtures in 10 Minutes

Setting up pytest-benchmark

This can simply be done with pip install Pytest-benchmark on the Terminal.

To enable additional features, such as visualizing the benchmark results, we can perform pip install 'pytest-benchmark[histogram]' to install the additional packages required.

Structure of a test

Similar to pytest with added benchmark fixture

In formal definitions, fixtures are reusable components used to set up or tear down code, control resource lifetimes, or parametrize tests. Putting it in working layman's terms, fixtures are used as arguments to the test function, and pytest inserts in the argument at runtime.

After installation, the keyword benchmark is available as a pytest fixture. Benchmarking can be done by providing the target function and the argument that goes into the function. The convention of a unit test structure still applies, where the test file and test function must start with the prefix test_.

In our example, let us test the timings of the Fibonacci sequence and compare the cases where caching is enabled or disabled. The following code will be in the source files:

from functools import lru_cache

def fibonacci(n):
    return n if n <= 1 else fibonacci(n - 1) + fibonacci(n - 2)

@lru_cache
def fibonacci_cache(n):
    return n if n <= 1 else fibonacci_cache(n - 1) + fibonacci_cache(n - 2)

To perform benchmarking, we will supply the target function (fibonacci and fibonacci_cache) and argument (in this case n=10).

For more tweaks, you can also specify the number of iterations (per round) and number of rounds to run! By default, the number of iterations and number of rounds is 1 which can be inconclusive or inconsistent. In the example below, we will set the number of iterations and number of rounds to 5.

def test_fibonacci_10(benchmark):
    benchmark.pedantic(fibonacci, args=[10], iterations=5, rounds=5)

def test_fibonacci_cache_10(benchmark):
    benchmark.pedantic(fibonacci_cache, args=[10], iterations=5, rounds=5)

Running pytest-benchmark

Similar to pytest with added pytest-benchmark options

The command to run is similar to running regular pytest and can be called with pytest tests assuming your test files are stored within the tests directory.

Fig 1: Benchmark results – Image by author

The results are sorted by ascending order of minimum time taken to run. For each column, the best timings will be in green font and the worst timings will be in red font. By default, the timing is in nanoseconds (e-09) and the number in bracket represents the multiple of the runtime compared to the fastest runtime.

For example in Fig 1, looking at the minimum time taken column, running Fibonacci with caching takes around 50 nanoseconds, whereas without caching takes around 10k nanoseconds, which is 211x of 50 nanoseconds. This result is expected as well, as caching leads to timing improvements.

There are more customization that can be appended to the command, such as

--benchmark-enable or --benchmark-disable: Run all tests, but enable or disable benchmarking of timings respectively (note: enable overrides the disable option if both are specified)
--benchmark-only or --benchmark-skip: Run only benchmark tests or only non-benchmark tests respectively
--benchmark-save=NAME or --benchmark-autosave: Saves benchmark data into a .benchmarks folder with or without name specification respectively, contains statistics of the tests
--benchmark-save-data: If benchmark data is saved, this saves additional information such as the timings of individual runs
--benchmark-sort=COL: Column to sort on, defaults to min

Fig 2: Comparing benchmark results to previous runs – Image by author

--benchmark-compare: If benchmark data is saved, this compares the previous run with the current run (refer to Fig 2)

Fig 3: Histogram of benchmark results – Image by author

--benchmark-histogram: Plots a box plot of the current run to visualize the distribution of runtime (refer to Fig 3)

Congratulations, now you have gotten benchmarking to run and can run your unit tests with timings and spot bottlenecks! You can also vary the input size to test if the code is scalable.

Now, if only everyone diligently runs unit tests and benchmarking before committing their code, that would be the best scenario. In reality, that is not often the case. Therefore, the testing and benchmarking workflow should be:

Automated: Reject codes automatically if they fail the tests
Actionable: Know which code to fix if any test fails
Intuitive: Test results should be understandable

Benchmarking is only useful when it is automated, actionable, and intuitive!

Incorporating testing workflow into GitHub action allows for the tests to run in an automated and actionable way. There are a few actions that implement this already:

github-action-benchmark (will be using this in later sections)
github-action-pull-request-benchmark
pytest-benchmark-commenter

The different GitHub actions implement benchmarking in different manners. I find the first one fits my use case the most – which was to compare previous runs and reject pull requests if necessary (automated), attach the results of the run to the commits (actionable), and visualize historical runs (intuitive).

The final GitHub Action file I have is here and I will highlight the relevant parts of the file in the next sections since the workflow also has other features such as code coverage report which is not in the scope of this article.

Compare previous runs with GitHub Action

Flow 1: Compare with the previous working code from the cache

jobs:
  codecov:
    ...
    steps:
      ...
      - name: Clear previous benchmark report
        if: ${{ github.event_name == 'push' }}
        env:
          GH_TOKEN: ${{ github.token }}
        run: |
          gh cache delete ${{ runner.os }}-benchmark --repo kayjan/bigtree
        continue-on-error: true
      - name: Download previous benchmark report
        uses: actions/cache@v4
        with:
          path: ./cache
          key: ${{ runner.os }}-benchmark
      - name: Store benchmark report
        uses: benchmark-action/github-action-benchmark@v1
        with:
          tool: "pytest"
          output-file-path: output.json
          external-data-json-path: ./cache/benchmark-data.json
          github-token: ${{ secrets.GITHUB_TOKEN }}
          comment-always: true
          fail-on-alert: true
          summary-always: true

In the first flow, for every pull request, I will download the latest working benchmark results and compare the timings. By default, the threshold is 200%, meaning that an alert and failure will triggered if the timings exceed more than 2x of previous benchmark results. The benchmark timing results will also be included as a comment in the commit with comment-always set to true.

Subsequently, for every merge of pull requests into the main branch, I will delete the existing cache and save a new cache of the benchmark results for future comparison. By doing so, the cached benchmark results are constantly being refreshed and updated instead of just comparing with a fixed benchmark result. A caveat is that GitHub clears the cache if it is not accessed for 7 days and I have not found a workaround for this.

Visualize historical runs with GitHub Action

Flow 2: Compare with all the previous working code from a persistent storage

jobs:
  codecov:
    ...
    steps:
      ...
      - name: Store benchmark report (gh-pages)
        uses: benchmark-action/github-action-benchmark@v1
        with:
          tool: "pytest"
          output-file-path: output.json
          gh-repository: "github.com/kayjan/bigtree-benchmark"
          github-token: ${{ secrets.BENCHMARK_TOKEN }}
          auto-push: true

In the second flow, every time the tests are run, I will save the benchmark timing results to a persistent storage. By default, it saves it in the gh-pages branch for documentation purposes. For my use case, I needed the result to be saved to the master branch, which is not possible as it is a protected branch. As a result, I saved the result to the gh-pages branch of another repository instead.

Fig 4: Line chart of benchmark timings across historical runs – Image by author

The benchmark results are incorporated in my documentation here, or as shown in Fig 4, in which I modified the original index.html file. The original display will look something like this instead.

Ideally, the benchmark timings should be consistent as the runtimes across time should not deviate too much unless major changes are made then the timings should reflect accordingly.

Benchmarking serves to track code performance, in this case, through runtimes. Benchmarking a version snapshot of the codebase can serve to spot bottlenecks and areas of improvement, while benchmarking the codebase over time can serve to spot for performance degradation or improvement.

Hope you have learned more about implementing unit tests with benchmark timings using pytest-benchmark and ways you can make it automated, actionable, and intuitive using GitHub actions!