Data Science

Filtering Noisy Data

A quick-and-dirty way to clean noisy datasets before training on them

Motivation

At Reverie Labs, we combine diverse data sources to construct the datasets that power our models. We regularly work with experimental data from chemical and biological assays. In addition to intrinsic experimental error, data from different sources can come with different reliability guarantees. We need to limit the incidence of noisy data so that we can be sure that we have accurate performance benchmarks and so that our models are able to generalize to held out test sets.

In this post, we'll walk through a straightforward method for identifying which data points to keep for training. This is a quick-and-dirty technique for cleaning datasets without much a priori knowledge necessary. We'll go through two contrived examples, with simple data distributions, to demonstrate what applying this technique looks like at the most basic level.

Identifying Incorrect Training Data

Outlier Detection

Broadly speaking, the task of detecting incorrect training data is very similar to, but distinct from, outlier detection. Outlier detection in the context of the dependent variable usually involves assumptions about the underlying data distribution. For example, in a simple case, one might use a linear regression model to fit a model \( f(X) → Y \).

The notion of an outlier in this case might be a data point whose predicted response variable is "far away" from the line of best fit, defining "far away" as we choose. However, with a linear regression model, there is an implicit assumption that \( Y \) is a linear function of \( X \). For any model, the definition of an outlier is dictated by the model itself.

Our hypothesis is that some portion of the data is simply inaccurate - we can even think of it as corrupted. The errors are present independent of the specific model that is being fit to the data. We want to impose as few limitations as possible on the data distribution.

Ensemble Filters

The technique that we use to mitigate model-specific bias is largely based on a paper written by Brodley and Friedl (Ref 1), extending the the intuition they have used for identifying mislabeled classification data points to noisy regression labels.

First, we make a distinction between "filtering" models and "learning" models.

Filtering models are used exclusively to determine whether or not a given data sample should be included the training data.

Learning models are used to fit the training data, for the purpose of predicting labels (outputs) for held-out validation and test data.

The overall idea is to train many independent filtering models (both linear and non-linear), and make each model "vote" to determine whether or not a given data point has an accurate label. Two reasonable choices for a voting scheme are:

Majority filter: if over half of the filtering models "incorrectly" predict the output, we exclude the data point
Consensus filter: if all of the filtering models "incorrectly" predict the output, we exclude the data point.

What does "incorrect" mean in this case? In the context of regression, we can transform all of our dataset's outputs into \( z_\text{score} \) space, and then choose a \( \Delta z_\text{score} \) threshold. If a filtering model's prediction is within \( \Delta z_\text{score} \) of the label, then we consider the prediction to be "correct;" otherwise if the prediction is not within \( \Delta z_\text{score} \) of the label, then we consider the prediction to be "incorrect." Data points with correct predictions receive positive votes, and those with incorrect predictions receive no votes.

Schematic of the basic idea: we train filtering models to decide which data points to keep, and then learning models only on the "approved" data.

The hope is that by collecting votes from different models, we are independently capturing many of the relevant features necessary for correctly predicting accurately labeled data points.

In order to train the filtering models, we use a \( k \)-fold cross-validation scheme, where votes are cast on the held-out fold. If we have \( n \) models, we will have \( n \) votes for each data point. To make the voting outcome less sensitive to the sampled \( k \)-folds, we can run this procedure \( m \) times, such that we randomly select the folds multiple times and there are \( m \times n \) votes for each data point at the end of the filtering process. Ideally, the filtering models should demand little overhead in terms of compute, since they will need to be trained from scratch for each split of the data.

Lastly, we keep only those training data points that are acceptable according to the voting procedure, and use those to train our learning models.

Examples

Simple Experiment

One of the first and most basic experiments we can do to verify whether this method can select noisy data points is by taking \( y = x \) and randomly adding noise.

Here, a single linear outlier detection method would work well, but the ensemble filtering models had better be able to do also!

In this example, we take \( y = x \) with \( x \in [0, 200] \) and randomly select 20% of the points to "corrupt," or add noise to. The new set of outputs is y_corrupted .

Next, we can select our filtering models. For illustrative purposes, we choose ElasticNet, BayesianRidge, DecisionTreeRegressor from sklearn. We sample for \( m = 10 \) rounds with \( k = 10 \) at different \( \Delta z_\text{score} \) thresholds, and calculate the Jaccard Index between the set of data points excluded by majority voting, and the set of data points that were corrupted.

As expected, at an appropriate threshold (around 0.14) our method is almost perfectly able to detect the corrupted data points! Having a look at the vote counts at this threshold confirms that the right data points were selected. Recall that positive vote is given if a data point should be included; this means that data points with low vote counts are suspected to be incorrect.

Vote counts for each data point at \( \Delta z_\text{score} = 0.14 \). Darker data points are further away from the majority-vote decision boundary, whereas lighter points are closer.

Slightly Harder Experiment

Rather than choosing a linear function with noise, we can also choose a more challenging data distribution - for example, the commonly used "S-curve" distribution, or the "Friedman" (Ref. 2) regression problems, which have explicit non-linearities baked in. sklearn has handy methods for generating data from these distributions.

Here, we'll focus on the S-curve dataset. Again, we generate 200 data points,

Distribution of \( y \) relative to each dimension of \( x \) for the generated S-curve dataset.

and add noise to 20% of the data points at random.

Using the same three filtering models as in the linear case, we obtain a new performance curve and associated votes.

You'll notice that the optimal threshold is different with this dataset (approximately 0.28), but once again the filtering method is able to vote correctly to cast out many of the noisy data points. Many of the errors seem to occur in the tail-ends of the y_train distribution, where there are pronounced non-linearities with respect to the third dimension of \( x \).

In this example, we chose our filtering models naively, but performance could be improved by more sophisticated selection, e.g. by adding more models and exploring different hyperparameters for each.

Nonetheless, even these three filtering models are notably better than using a single linear model. If we only cast votes using ElasticNet, we toss out anything that can't be approximated by a linear kernel (approximately anything outside of \( -2 \leq y \leq 2 \) ).

Votes from using only ElasticNet as a filtering model, at the associated optimal \( \Delta z_\text{score} \) threshold. Non-linearities are not effectively handled.

In Practice

At Reverie, we use this type of filtering method extensively to help clean the datasets that we compile. In some cases, we've been able to boost model performance significantly after cleaning the training set. The evaluation metric that we use for improvement is comparison of performance on a reliable external test set, i.e. performance metrics of choice (e.g. \( R^2 \)) of learning models trained on all of the data vs. metrics of learning models trained on the reduced (cleaned) set of training data. Those models have gone on to be key factors in determining which compounds we synthesize and pursue as therapeutic candidates.

When using this technique with real-world datasets, we do not know beforehand which data points are corrupted, and therefore cannot compute a score such as the Jaccard Index between the set of suspected corrupt data points and the set of actual corrupt data points. Instead, to find a good \( \Delta z_\text{score} \) threshold, we search over reasonable values for \( \Delta z_\text{score} \) and train learning models each time. Then, we can use the threshold which yields the best performance on the held-out data. Analyzing the data distribution can also help inform what a reasonable noise level might be, and what range to search over. This method is also not necessarily a substitute for domain-specific knowledge when it is available (e.g. understanding the feasible range of assay outputs); in fact we use both in conjunction. We can often filter out incorrect data points by inspection before passing the dataset into the filtering models.

Try For Yourself

The code that was used to generate the figures in this post is available in a Python notebook at https://colab.research.google.com/drive/1GG4zXwByqmhnVA3Hcoxe3rwVPj5Ok40c.

Join Us

We’re actively hiring Full-Stack, DevOps, Front-End, and Machine Learning Engineers. If you're interested in developing computational tools for drug discovery, take a look at our current openings at https://www.reverielabs.com/careers.

References

Brodley, C. E., & Friedl, M. A. (1999). Identifying mislabeled training data. Journal of artificial intelligence research, 11, 131-167.
Friedman, J. H. (1991). Multivariate adaptive regression splines. The annals of statistics, 19(1), 1-67.