Recursive Chemical Reactions

RDKit is an open-source cheminformatics toolkit written in C++ that can also be used in Java, Python, and KNIME. It provides a wide collection of cheminformatics functionality, such as reading and writing molecules, working with atoms, bonds and rings, generating 2D or 3D coordinates, searching for substructures, applying chemical transformations, and computing fingerprints and descriptors. RDKit also offers a high-performance database cartridge for PostgreSQL. RDKit understands SMARTS, the language for describing molecular patterns, and SMIRKS, the language for applying reaction transformations, both of which use as a basis SMILES, the famous line notation for entering and representing molecules and reactions.
RDKit can apply reaction transformations and combined with the recursion possibilities of Python it can support niche use cases such as the one described in this article. In particular, we will apply chemical reactions recursively to examine if an input molecular structure is a peptide, i.e. a linear sequence of amino acids. I have found that applying reactions recursively is a good way to algorithmically analyse chemical structures, e.g. by iteratively fragmenting the structure, cleaving off well defined fragments and categorising left-over moieties when no more transformations can be applied.
· Introduction · The building blocks: amino acids · Identifying the peptide bonds · Breaking the peptide bonds · Conclusions
Introduction
The image below is a linear oligopeptide with four amino acids, namely arginine, alanine, threonine and methionine and it was found in a recent dataset I was analysing. I work with industrial chemicals and I was surprised to see such structures in our database as we have nothing to do with biomolecules and peptides are not so commonly encountered. Digging further I discovered that we have erroneously generated a number of such structures by interpreting chemical names with chemical name to structure algorithms. In addition to their full name, amino acids can also be represented with one or three characters that sometimes occur in text and are wrongly interpreted as chemical structures. This data quality issue may affect publicly available datasets too and hence I thought that it may be useful to find a way to detect them and eliminate them if they occurred unintentionally.

There are many ways to do this and I imagine that all linear peptide sequences could be matched by a complex SMARTS query. Developing this seemed intimidating to me and hence I thought that it may be more viable to solve this problem by breaking it up in smaller ones. In Chemistry terms, this means that we have to break the initial molecule in smaller fragments by hydrolysing the peptide bonds one by one until there are no more bonds to break. If all fragments obtained are amino acids then the starting structure must be a peptide. Doing so also allows determining the exact amino acid sequence. What we need is a way to apply a chemical reaction recursively. If you are interested in learning how this can be done with RDKit and discovering some of the functionality of this rich cheminformatics library along the way, please keep reading.
The code for this article can be found in my GitHub blog repository. We will be using RDKit version 2022.9.4 with Python 3.9.13. The repository also includes all dependencies in the provided requirements file.
The building blocks: amino acids
There are 20 common amino acids that make up proteins. There are also two more additional amino acids that are in some species coded for codons that are usually interpreted as stop codons. All these are alpha-amino acids, i.e. the amino group is directly bonded to the α-carbon, i.e. the carbon atom next to the carboxyl group. For convenience I have created the SMILES for these 22 amino acids that can be found in the repository.
The first step is to read in the 22 amino acids, and at the same time import all necessary RDKit modules.
The smiles are converted to native RDKit Mol objects in the penultimate line. The last line may seem a bit cryptic, but all it does is to include the amino acid name within the molecule as can be verified with
print(Chem.MolToMolBlock(amino_acids['mol'].iloc[0]))
that prints

The amino acids can be visualised in a tailor made grid using matplotlib as shown below.
The last line saves the image below as a PNG file, that you can find in the repository together with all other images in this article.

RDKit can be seamlessly used within jupyter notebooks, where the Draw module allows the effortless visualisation of molecular structures in a grid using the Draw.MolsToGridImage()
function. Nevertheless, I find that using matplotlib allows more flexibility, especially if one follows these excellent suggestions on how to tweak the plot components. All of the amino acids share the following enantiomeric backbone.

A first useful RDKit functionality that we can introduce at this stage is the so-called R-group decomposition. In the code below we define the amino acid backbone core with the smiles [:1][C@H](N[:2])C(O)=O that has two explicit R group labels. The reason for using two R group labels is the presence of the pyrrolidine ring in L-Proline. By explicitly providing the settings of the R-group decomposition, we constrained it to match only the explicitly specified R groups.
The rest of the code creates the necessary input arrays with molecules and legends to generate the image using the same utility function as earlier. If you look carefully you will notice that glycine failed to be decomposed into R-groups. The reason is that it is not chiral, whilst the core structure used for the decomposition is. The decomposition of glycine would have succeeded if we had removed the chiral centre from the core, but then the R-group decomposition would have been less specific that may be undesired.

R-group decomposition is useful in case the amino acid backbones were needed for further processing. We will not pursue this further in this article.
Identifying the peptide bonds
Before we attempt to break the peptide bonds we can check whether we can identify them. We use an example linear tri-peptide formed by threonine, methionine and arginine. The peptide bond is defined as a structural pattern that is then used to locate the matching atoms and bonds in the original molecule.
that produces the image below.

Using the rdkit.Chem.Draw.rdMolDraw2D
the peptide bonds were nicely orientated. The substructure search returned a tuple of tuples with atomic indices that was flattened and used to find all bond indices between all atom pairs in the flattened list. The indices of the atoms and bonds in the peptide chain were highlighted using a light grey color, before the molecular structure was saved as the PNG image. The key message here is that by using RDKit one can have control down to the level of atoms and bonds that essentially form a graph amenable to any imaginable algorithm.
Breaking the peptide bonds
Breaking the peptide bond requires the definition of a reaction using SMIRKS. The previously defined peptide pattern used in the substructure matching becomes the reactants part of the SMIRKS reaction. The products are nothing more than the hydrolysed product with the two amino acids separated using the dot notation. The threonine, arginine, methionine tri-peptide has two peptide bonds and hence the reaction can be applied twice leading to two reaction outcomes, both of which comprise an amino acid and a di-peptide.
The code above produces the following PNG image with the two reaction possibilities in the two rows.

If we want to break down fully a peptide into its constituent amino acids we will need to apply the reaction repeatedly, until no more peptide bonds can be broken. It is not necessary to enumerate all possible reactions at each stage. We can simply apply the reaction once, take the two reactants and hydrolyse them separately. RDKit allows controlling the number of products by setting the maxProducts
argument to one. What makes the starting structure a peptide? If after applying all possible peptide hydrolysis reactions we have only produced one or more of the known amino acids then the starting structure is a peptide. Conversely, if at some point no peptide hydrolysis reaction can be applied and the structure is not one of the known amino acids, then the starting structure is not a peptide.
The code above uses two utility functions, one to examine if two structures are chemically equivalent by being a substructure of each other, and another to examine if a structure can be found in a list of structures. The recursive hydrolysis of peptide bonds is implemented in the third and last function.
Using this recursive function we examine if a set of nine example molecules are peptides.
The algorithm correctly classifies the first 8 structures as peptides and the last two as non-peptides. Notably, we used the convention that the amino acids themselves are a peptide, that strictly speaking may not be true, but this was convenient for applying the recursion.

The algorithm could be enhanced by adding the reactant results to a graph, using for example NetworkX and visualising the reaction progress by drawing the structure at each node. The leaves would then be the amino acids that can be analysed further to obtain the exact sequence of amino acids in the peptide. The possibilities are endless; RDKit has done its part and one can then rely on the expressiveness of Python for the rest.
Conclusions
RDKit is a rich cheminformatics library. It can nowadays be easily deployed using pip and opens up the possibility of using Python and its data analysis and Data Science ecosystem in chemistry applications. RDKit's documentation is admittedly not the best, but there are now many tutorials and blogs available. The library keeps evolving and new functionality is being added. I hope this article is useful for illustrating some of RDKit's functionality and demonstrating its potential.