While the workflow we describe here processes TSA data, the concepts and processes can be adapted to accommodate data from virtually any screening technique. Our workflow takes as inputs thermal shift assay data, a plate map containing chemical structures, and fitting parameters, and outputs melting temperature information for each compound. It has been designed as a sequence of 10 steps which guide the user through data analysis process and can be sequentially repeated for multiple plates (Fig. 2). Each step is color-coded in keeping with the KNIME node color scheme: orange is reserved for data input, yellow for data transformation, blue for exploration/visualization, and red for data output. This work was developed using KNIME version 4.4.0 and uses nodes from the KNIME Analytics Platform, open-source KNIME Extensions, and Community Extensions by RDKit, R, and the High-Throughput Technology Development Studio (TDS) of the Max Planck Institute of Molecular Cell Biology and Genetics (MPI-CBG). The RDKit extension is a collection of nodes that provides basic chemistry and cheminformatics functionality. The Interactive R Statistics Integration comprises nodes that interact with an external R™ installation [26] to write and execute R scripts. The high-content screening nodes provided by the MPI-CBG provide a powerful solution for mining screening data that is easy to use, but flexible enough to cope with various types of assays. These extensions should be downloaded once prior to first use. The KNIME workflow is freely available for download at https://github.com/loicsamuel/knime-tsa-analysis. Instructions for users explaining how to install and use the workflow are also provided at the same link and in Additional file 1.
Step 0: R server initialization
The first step of the workflow is to start the R server by running R and executing the command:
After command execution, the KNIME platform can communicate with the R server for executing nodes of the Interactive R Statistics Integration. Non-linear curve fitting of the melting curves is carried out using the Dose Response (R) node which uses the ‘drc’ R package [27] as a back end and ‘ggplot2’ for plotting the fit output. In the event that this workflow is used successively, step 0 is not required on subsequent runs.
Step 1: input data
The components in Step 1 reformat the raw output data from qPCR instruments into a table containing columns for well position, temperature, and fluorescence. These fields are the guiding requirement for workflow compatibility with a text file from any qPCR instrument, so users are asked to define these columns for the second component of this step. The data is then split into 2 outputs. The first is pivoted into a single temperature column and columns containing fluorescence values for each well at each temperature point, then transferred to step 2 for plotting, while the second output is transferred unchanged to step 3 for further processing (Fig. 3). We have successfully imported and processed the data file output from a Roche LightCycler® 480 II instrument, but the step 1 component can be modified to accept and reformat any common file output. For example, we have also successfully imported a pre-processed Microsoft® Excel output from the Applied Biosystems QuantStudio 7 Flex Real-Time PCR System.
Step 2: examine melting curves
The user is encouraged to examine the melting curves to confirm that the majority are well-behaved, ideally with a flat initial pretransition region followed by a steep sigmoidal unfolding region, and finally an aggregation region, as shown in Fig. 1. The user should decide on a threshold for excluding wells which do not have sufficient fluorescence signal for reliable fitting and will therefore not yield sensible results. This step is also critical in the selection of an appropriate initial fitting window, as restricting the fitting algorithm to the transition region improves workflow performance.
Step 3: experiment details
In step 3, the user is prompted to provide a plate map, in the form of an SD file, containing all library compounds and their locations (chemical structure, compound ID, plate ID, row, column, etc.), the initial fitting window, and a threshold for excluding wells which lack a suitable fluorescence signal. As an alternative to the SD format, a workflow is provided which accommodates a plate map of compound SMILES in CSV format. Example data, along with plate maps and KNIME workflows are provided as Additional files 2, 3, 4, 5, 6 and 7. The plate number, row, and column data contained in the plate map are used to associate the chemical structures and compound IDs with the temperature and fluorescence data. The joined data is then passed on to subsequent nodes which remove data outside of the fitting window and plot the resulting curves.
Step 4: flag problematic wells
First, the range of fluorescence values in each well is calculated and compared to the threshold specified in step four. Wells that do not meet the fluorescence signal threshold are flagged and separated for plotting and review in step 5. The user should examine these melting curves to determine whether they were correctly flagged. The fluorescence range threshold should be adjusted until the filter captures, as much as possible, only the low-signal curves. Melting curves that meet or exceed the threshold are then analyzed for the presence of a downward slope or high initial fluorescence, i.e., whether the global maximum fluorescence occurs at a lower temperature than the global minimum fluorescence. After this, the curves are analyzed for the presence of gaps, jumps, and spikes. By default, these are defined as regions where Δf (the change in fluorescence between two successive data points) is more than 25 times the mean Δf for that well. This criterion can be modified to suit the data being processed—noisy data would require a higher threshold. We suggest starting with a lower value then increasing it if necessary.
Step 5: review pre-processing results
Here, the user can review the processed melting curves to confirm that the shapes are sigmoidal with well-behaved minima and maxima, and a sigmoidal transition region that is completely inside the fitting window. Importantly, the user is also advised to confirm whether the flagged spectra are indeed problematic, and to adjust the thresholds and the fitting window of step 3 to avoid anomalies while capturing as much real data as possible. Figure 4 illustrates several types of melting curves which are flagged by this component. The well selection feature in Step 1 allows for the isolation of problematic individual wells if desired, and a Tm can be calculated on these by adjusting the initial window for a more custom fit.
Step 6: non-linear regression
Inside this component is a loop, which for each well,
-
1.
Applies a 21-point center simple moving average for smoothing of the fluorescence values
-
2.
Normalizes fluorescence values between 0 and 1
-
3.
Determines the temperature value at which the minimum fluorescence value occurs
-
4.
Discards fluorescence data which occur at lower temperatures than the value determined in (3)
-
5.
Determines the temperature value at which the maximum fluorescence value occurs
-
6.
Discards fluorescence data which occur at higher temperatures than the value determined in (5)
-
7.
Fits the fluorescence data from each well to a 5-parameter log-logistic model
$$f\left(x\right)=c+\frac{d-c}{{\left[1+\mathrm{exp}\left(b\left(\mathrm{ln}\left(x\right)-\mathrm{ln}\left(\mathrm{e}\right)\right)\right)\right]}^{f}}$$
where b = slope, c = minimum fluorescence value, d = maximum fluorescence value, e = inflection point, and f = asymmetry factor, i.e. the difference in the rates of approach from the inflection point to the lower and the upper asymptotes (Fig. 5). If the melting curve is symmetrical then f = 1 and the equation becomes a 4-parameter log-logistic.
-
8.
The inflection point of an asymmetric sigmoidal curve does not coincide with its midpoint. The midpoint Tm is calculated using the fit parameters according to the equation
$$midpoint {T}_{m}=\frac{e}{\mathrm{exp}\left[-\frac{1}{b}\mathrm{ln}\left({2}^\frac{1}{f}-1\right)\right]}$$
The component returns the melting temperature, fit parameters with error estimates, and the fitted curves.
Step 7: define hit threshold and compile results
Once fitting is complete, the user can define the hit threshold for hit selection by double clicking on the component in Step 7. By default, this is set to 3 times the standard deviation of the melting temperatures of the DMSO-treated wells. The component splits the data into two branches for further manipulation. The first branch groups the logistic fit results by molecule ID number and calculates the average melting temperature and standard deviation, thereby providing a summary of replicates if present. The melting temperatures are then compared to the previously defined hit threshold and hits are tagged as such. The second branch creates a heatmap of the plate, colored by melting temperature.
Step 8: review melting temperature heatmap
At this stage, the user is encouraged to inspect the array of melting temperatures using the Plate Heatmap Viewer node. This node provides access to three levels of data with corresponding levels of detail: screen → plate → well. The screen view (Fig. 6a) is the default view, a trellis containing each plate in the experiment. This view is invaluable for visualizing row and column artifacts and for comparing plate performance across a screening campaign. For this workflow there is only one plate. The View menu provides the ability to change the colormap, while the trellis toolbar allows the user to change the readout as well as overlaid information, such as a hit tag. The second data level is the plate view (Fig. 6b) which is accessed by right-clicking on a plate in the heatmap trellis. There are identical controls as in the screen view, and individual wells can be inspected by hovering over them with the mouse. Right-clicking on a well provides access to the third and final data level, the well view (Fig. 6c). This view provides a detailed view of the wells along with information such as well position, compound ID, and chemical structure. The user is advised to note any patterns that may indicate the presence of problematic wells, e.g., lack of consistency in DMSO control wells, or systematic errors such as the presence of edge effects, outlier rows/columns, or unexpected patterns for ostensibly randomly plated compounds.
Step 9: output
The workflow produces two files, located in the same folder as the initial thermal shift data file and sharing the same name. The first is an SD file containing all the columns of the initial plate map file along with melting temperature, residual standard error of the fit, ΔTm, |ΔTm| and hit status. Alternately, a second workflow is provided which outputs a CSV file with compound SMILES instead of an SD file. The second output is an XLSX document containing various sheets:
-
(a)
Plate Heatmap: this includes a heatmap of the melting temperatures across the plate
-
(b)
Fit Results with hits annotated: a list of all fitted compounds including (i) an image of the molecule, (ii) molecule ID number, (iii) well position, (iv) plate row, (v) plate column, (vi) plate ID number, (vii) melting temperature, (ix) residual standard error, (x) ΔTm, (xi) |ΔTm|, (xii) hit status
-
(c)
DMSO mean and SD: this includes (i) molecule ID, (ii) control Tm, (iii) standard deviation of control Tm, (iv) hit threshold as 3 × standard deviation of control Tm, (v) hit threshold as control Tm + 3 × standard deviation of control Tm
-
(d)
Fitted Melting Curves: this includes the well position and a graph of the sigmoidal fit for each fitted well
-
(e)
Melting Curves: this includes the graphs showing the overlaid raw melting curves, cropped melting curves, accepted melting curves, melting curves with anomalies, melting curves with low signal, and processed melting curves used for fitting.
Step 10: data consolidation
Data files loaded into the KNIME workflow during step 1 will typically consist of TSA raw fluorescence data from a single plate run of the qPCR instrument, but screening libraries are typically large enough to span multiple plates. The next steps in the screening project are to compile and compare all available data and it is inconvenient to have that data spread across the multiple files that will be produced from multiple runs of the workflow. This last (optional) workflow component merges the results files produced during step 9. When executed, the component in step 10 reads all SD (or CSV) files saved in the “Results” folder and combines the data into a single table which is then is saved as both XLSX and SD (or CSV) files.