Definition
The simplest kind of Mixfile represents a mixture that is essentially a single component with a purity value, as shown in Fig. 1. The singular component is described by three pieces of information: the structure of the butene derivative, its name, and the concentration which is given as ≥ 97%. This representation only requires a single component because the impurities are unknown, and thus unspecified. This simple example represents a use case that is incredibly common, especially within reagent catalogs.
Another very common use case is when the active ingredient is provided as a solution, as shown in Fig. 2. In this case, the hierarchical nature of the Mixfile format is invoked. The root node is blank, although it can be used to store secondary metadata about the mixture overall. It contains two components: the active ingredient and the solvent. Both of them are represented by name and structure. The active ingredient, triethylaluminium, is indicated to be 2 molar. The concentration of the solvent, toluene, is left blank, which by convention means that it makes up the remainder of the mixture. While it would be valid to calculate the molarity of the solvent and include this information, it is superfluous, and for convenience and representational clarity, is better left out.
Mixfile hierarchies have no limit to their depth or height, and use of nesting is a convenient way to express mixtures-of-mixtures. For example, consider n-butyl lithium dissolved in the solvent that is colloquially referred to as hexanes, shown in Fig. 3.
This particular choice of hierarchical description clearly indicates that the substance being described is a mixture of two distinct things: the reagent and the solvent. The solvent occupies one container node, which is described using the name hexanes. Because it is itself a mixture, it does not have a structure, and it is also not given a concentration (since it is implied to constitute everything other than the reagent). The hexanes component has four sub-components assigned to it, which represent the major C6 isomers that make up the solvent. If the relative proportions of the isomers were known, they could be expressed as concentrations (e.g. as a ratio, or volume/mass/molar percentages), but in this case, the proportion is not provided by the manufacturer. As such it illustrates that the Mixfile format is comfortable with incomplete data, which is important since it would be incorrect to insist on providing information which is not available.
One very practical reason for taking the effort to describe substances such as organolithium reagents is that the safety and hazards vary based on composition. Consider the related and much more dangerous tertiary butyl lithium reagent, which is shown in Fig. 4.
Knowledge of the active ingredient alone (t-butyllithium) is sufficient to ascertain that this material is pyrophoric, since it has this characteristic in all of its forms. For n-butyllithium, however, solutions are pyrophoric only at higher concentrations (ca. 10 mol/L and above) [5]. Therefore being able to keep track of the active ingredient and its concentration is essential for being able to provide appropriate safety, handling and disposal advice. In the case of these two organolithium reagents, the solvent composition is also important, e.g. t-butyllithium is commonly sold as either pentane or heptane solutions, and these solvents have drastically different volatility, which is a very important detail for a mixture that bursts into flame on contact with air. Any hazard database would be incomplete (and possibly dangerous by omission) without the ability to store and match all of these facts.
Another important consideration with highly reactive reagents like organolithium solutions is that they decay over time and need to be titrated [6, 7] to redetermine the concentration. This means that it is not sufficient to mark samples with a reference to the properties that it had at time of purchase, rather it needs to be recorded with a datastructure that can capture the changing concentration, and ideally do so in a way that can be useful (e.g. combine with reaction planning software to calculate the volume required for stoichiometric use).
The component hierarchy can also be used to represent mixtures of isomers, which is a common use case for the outcomes of reactions that are not followed by an effective purification step, e.g. the result of Markovnikov addition [8] of bromine, shown in Fig. 5.
While some kinds of isomers can be effectively represented within the structure of a single component (such as racemic stereoisomers), enumeration is often preferable even when there are alternatives. Enumeration has some advantages over more concise encoding options, e.g. the visualisation is very clear, assignment of relative concentrations is straightforward, and the implementation is simple.
Recording information about the properties of mixtures is important for a great many reasons, not least of which is safety. For example, consider two commercially available forms of osmium tetroxide, shown in Fig. 6. The Mixfile represented in (a) is the solid form which is mostly pure, while (b) is the same active ingredient as a dilute solution in water. Both of these materials are extremely toxic, but the instructions for storing, handling and disposing of them are quite different. Without a well defined machine readable format for drawing the distinction between the raw solid and the dilute solution, locating the right material safety datasheet would be dependent on the knowledge and experience of the scientist performing the lookup. Another poignant example is sodium azide, which is extremely toxic in its pure solid form [9] but when dissolved in water at concentrations of lower than 0.1% it is considered benign enough to use as a food preservative [10].
Mixture descriptions are also relevant outside of the chemical laboratory, since there are innumerable consumer products that could benefit from descriptions with detailed metadata, such as is shown in Fig. 7. Example (a) describes a common brand of toothpaste, while (b) is a tablet formulation for eletriptan [11]. Both of these household products have common characteristics in terms of how the mixtures are defined: each of them has an active ingredient (sodium fluoride and eletriptan hydrobromide respectively) and a host of inactive ingredients. The active ingredients are usually the focus of these consumer products, but the additional materials that are added are very important: they typically impart characteristics that affect stability, texture, flavour and efficacy. They are also common sources of concern regarding toxicity and unwanted side effects, and so compiling accurate, complete and machine readable data for all of the constituents is important, not least of all because it would be possible to quickly identify all such consumer products with any particular component in question whenever there are health concerns. From the R&D perspective, drug formulation is an empirical process: the exact composition and amount of each excipient is an essential characteristic of a drug tablet, and so accurately recording all experimentally determined formulations and pairing them with their effective efficacy is an essential part of product design.
For consumer products even more often than laboratory reagents, some portion of the constituents may not be readily represented by one-or-several distinct chemical structures. Recognition of this limitation is a key design consideration of the Mixfile: in these cases, whatever metadata is available should be provided. There is typically an available name of some form, and sometimes references to external databases that contain information about mixtures, e.g. the Chemical Abstracts Registry Number (CASRN) [12] is often used. These references are not inherently machine readable, and so must be thought of as a placeholder: facilitating a non-automated fallback is preferable to omitting the information entirely, and part of the future work for this project is to expand the ability to describe more complicated structure fragments, like polymers.
Software
In order to make use of the Mixfile format, we have created a straightforward editor that can be used to define mixtures. Figure 8 shows several panels: the main editor window (a), represents the hierarchical outline of the mixture. The components that make up this tree can be added, deleted, moved, edited, etc., using conventional menu, mouse and keyboard shortcuts. Editing individual components brings up either of two dialogs: one for general details (b) and another for sketching the structure (c).
The mixture editor has the ability to invoke the calculation of InChI strings for any of the constituent structures, which is done via the standard command line tool (which is installed separately [13]). As described subsequently, it also has the ability to create the correspondingly derived MInChI notation for the mixture.
As the Mixfile project evolves, the editor will be improved incrementally, and the latest developments will continue to be made available as open source software. One example of an additional utility feature is the ability to lookup structures by name in an external database, shown in Fig. 9. This is a convenient way to fetch structures for which the name is known, so as to avoid having to draw or locate-and-paste the corresponding sketch. At the time of submission, only PubChem is supported, though this could easily be extended to support other databases.
While the best case scenario for generation of machine-readable metadata is to have it created directly by the originating scientist in a format that can express all of the details, the fact is that almost all of the existing mixture information is expressed as text. These text descriptions are usually quite understandable to humans, although on occasion the chosen syntax can be ambiguous, even to an expert. Many of these text descriptions occur within long form paragraphs (e.g. literature publications), but they are quite often abstracted out with a clearly defined beginning and ending: this is observed frequently in online vendor catalogs (e.g. Sigma-Aldrich [14] ThermoFisher [15] Alfa Aesar [16] and many others) and within bespoke chemical inventory systems.
It is possible to compose a set of rules that can interpret a large proportion of mixtures from such a dataset. Consider a simple example such as “
1-Aza-12-crown-4
≥
97.0%
”, which describes a single known compound that makes up the majority of the material, and by implication, some number of unknowns that make up the remainder. A parsing operation can be graphically depicted, as shown in Fig. 10. The first rule ascertains that 1-Aza-12-crown-4 is the name of a chemical entity which can be mapped to a structure definition. The second rule determines that ≥ 97.0% is a quantity definition which provides relation, value and units.
Mixfiles for which multiple components are defined explicitly require more parsing steps. The most common laboratory examples being reagent-in-solvent pairs, expressed using text such as “
Trimethyl(trifluoromethyl)silane solution 2
M in THF
”, shown graphically in Fig. 11. In this case the parsing rules need to find the boundary point between the two components, and recursively analyze those. An overall rule of {solute definition} in {solvent definition} applies to this example, although care needs to be applied to make sure that the occurrence of the very short keyword in is being handled correctly.
Once the boundary is defined, the parsing continues: the solvent is defined as THF, which is a well established abbreviation for tetrahydrofuran. The active ingredient requires several more steps: the suffix of 2 M is taken to be a quantity definition. The capital letter M in this context is shorthand for molar, so the concentration is interpreted as 2 mol/L. Once the quantity information is processed and removed, the remaining text needs to be further truncated: the use of the word solution is superfluous, and requires a deletion rule. Once this is done, the remaining text—trimethyl(trifluoromethyl)silane—is a legitimate chemical name that can be parsed and converted into a structure.
These two case studies are representative of a large number of common text mixture descriptions for laboratory reagents. In the Methods section we describe a brief summary of our ongoing work toward text extraction of mixtures, and the availability of data that we have generated thus far. A collection of several thousand mixture examples is also included within the open source GitHub project, all of which have been generated using our proof of concept text extraction method, some of which are shown in Fig. 12.
The text-to-structure recognition that makes up a key part of the extraction process can be done using one of several available algorithms. For practical purposes it is necessary to combine this functionality with a lookup table, since it is very safe to assume that no algorithm will correctly interpret all of the important structures in any sizeable collection. Furthermore, there are cases where a name is correlated to a sub-mixture (e.g. the ever common hexanes and xylenes), and these can be handled by providing the lookup table with the ability to insert a mixture branch.
Mixtures InChI
The Mixfile format that we describe in this article is suitable for use as a reference container, which is appropriate for detailed archiving purposes. It can be easily rendered to create a print quality visual representation, and it can be extended to store any kind of additional metadata beyond the baseline specification. The development of this format and its associated tools have been heavily influenced by our collaboration with IUPAC, and their proposed Mixtures InChI notation, abbreviated as MInChI. By design, the Mixfile container representation can be used as the source material to generate a MInChI string, which involves extracting fundamental information about components, and imparting to them the canonical standardisation and layer motif that comes from using InChI as the structure identifier.
As can be seen in Fig. 13a, a simple mixture like this example where caffeine is listed with a specific purity, the corresponding MInChI string is dominated by the structure identifier from the standard InChI generator. The string is prepended by the signifier that identifies it as conformant to the MInChI specification, and followed by two additional layers: the hierarchy (which is in this case is a singleton), and the concentration which is encoded in a concise mnemonic form.
Example (b) contains two components, which are listed in the structure section. The hierarchy block is indicative of a mixture with a flat hierarchy. In the MInChI string, the component layer is sorted alphabetically by the InChI strings (which coincidently happens to be the same order as was given in the source Mixfile). The concentration block has one section for each component, but the second entry is blank, since the concentration is not indicated (i.e. presumed to make up the remainder of the mixture).
Example (c) is somewhat more exotic, as a mixture with multiple sets of components with 3 levels of hierarchy. Additionally, 3 of the component nodes have no structure specified. In this case the branch ordering differs from that used in the Mixfile. The hierarchy indexing portion of the MInChI string denotes the shape of the tree using curly braces. Three of the nodes have specified concentrations: the lithium diisopropyl amide ingredient has an overall molarity, and the THF/hexanes constituents are expressed as proportions, which apply specifically to the portion of the hierarchy (i.e. the actual definition of hexanes in this example is enumerated explicitly by its structures, and their approximate concentrations relative to each other are defined within their own branch).
While both Mixfiles and MInChI’s are used for the same kinds of data, they serve distinct roles within the overall cheminformatics infrastructure. The MInChI notation has some key benefits relative to the originating Mixfile:
-
it is concise, limited to a single line made up of ASCII characters, which can be easily manipulated in a spreadsheet or pasted into a single input line on a web form
-
it enables easy reference for similarity comparison: two mixtures with the same constituents will be identical up to the indexing and concentration sections
-
testing for the presence of a structure within a mixture is extremely easy (e.g. whether the query InChI identifier is contained within the MInChI string)
-
similarly, structures can be separated out and indexed individually by their InChI codes
-
relatively sophisticated comparisons of composition and concentration can be made using simple string manipulation, without the need for a dedicated cheminformatics library
These characteristics are all relevant for implementation in a database, where user search queries and indexing operations can be carried out using built in operators or simple scripting languages, which do not always have convenient cheminformatics libraries readily available. Providing the ability to search for a single structure within any mixture becomes very simple (any implementation of string indexOf will suffice, as long as the query structure can be converted into an InChI identifier).
Performing comparisons between mixtures can be achieved with some relatively straightforward logic. Consider a scenario where a database is being searched for mixtures that are similar to the query, shown in Fig. 14(a), and considering (b) as the potential candidate. Both of these mixtures represent dimethylamine at an analogous concentration, dissolved in two different solvents. Comparison of the two MInChI strings can quickly establish that each mixture has two components, and they share one in common. The common structure, which is the active ingredient (with an InChI fragment of
C2H7
N/c1-3-2/h3H,1-2H3
), is given a concentration on both sides: for (a) it is specifically 90 g/L, whereas for (b) it is between 1.9 and 2.1 mol/L. Because the InChI identifier fragment begins with the molecular formula, it is straightforward to calculate the molecular weight (using a very simple lookup table for the elements, and a very short block of code). This can be used to ascertain that 90 g/L is approximately 2 mol/L, and so both of these mixtures have a common ingredient with a common concentration, with a different solvent.
As with the standalone structure identifier (InChI), there are usually compelling reasons to retain the more detailed source information, e.g. consider the MInChI string as a composition notation that is regenerated from a Mixfile, because it is not intended to be the primary record of data. The MInChI generation process abstracts the structure identifier, concentration and proportional relationships of the components as stated in the original description. Both the MInChI string and its constituent InChI identifiers are only reversible in a partial sense: converting forward (i.e. Mixfile to MInChI, or Molfile to InChI) reduces the degrees of freedom in order to improve its utility for specific purposes. Any given MInChI or InChI string can correspond to numerous different-but-equivalent expressions of a mixture or structure, but reversing the transformation generally does not rederive the original input. In the case of the InChI identifier, this is easy to observe, since InChI does not preserve the coordinates of the input molecules, so the reverse process must recreate them algorithmically. Other modifications, like picking a canonical tautomer, normalising stereocentres and disconnecting bonds to metals further reduce the correlation to the original input structure. In addition, for the Mixfile to MInChI transformation, properties such as structure names, auxiliary identifiers, etc., are not stored in the MInChI notation. It may sometimes be possible to rederive these, but there is no guarantee that they will be the same as the original.
This unidirectional reduction of information is key to the practical value of InChI and all of its derivatives: being able to treat a string as a uniquely and literal definition for a chemical entity makes a great many complex and resource intensive cheminformatics tasks almost trivially simple. The MInChI notation leverages these fundamental InChI properties. The caveat is that an archiving system is advised to also store data in its original form, prior to any original processing, which is a familiar maxim of science (i.e. never throw out the original laboratory notebook).
At the time of writing, the MInChI specification is nearing Phase 1 completion, and is expected to be formally released later in 2019. Updates will be posted on the IUPAC project page [17]. If you are interested to implement MInChI notation in your local systems, please contact the authors.