avoiding bias by blinding

July 3, 2020 at 11:15 am | | literature, scientific integrity

Key components of the scientific process are: controls, avoiding bias, and replication. Most scientists are great at controls, but without the other two, we’re simply not doing science.

The lack independent samples (and thus improper inflation of n) and the failure to blind experiments are too common. The implications of these mistakes, especially when combined in one study, mean that many published cell biology results are likely artifact. Generating large datasets, even with a slight bias, can quickly yield “significant” results out of noise. For example, see the “NHST is unsuitable for large datasets” section of:

Szucs D, Ioannidis JPA. When Null Hypothesis Significance Testing Is Unsuitable for Research: A ReassessmentFront Hum Neurosci. 2017;11:390. https://pubmed.ncbi.nlm.nih.gov/28824397/

Now combine these common false positives with the inclination to publish flashy results, and we’ve made a recipe for unreliable scientific literature.

I do not condemn authors for these problems. Most of us have made one or all of these mistakes. I have. And I probably will again in the future. Science is hard. There is no shame in making honest mistakes. But we can all strive to be better (see my last section).

Failing to perform the data collection blinded

Blinding is just basic scientific rigor, and skipping this should be considered almost as bad as skipping controls.

Blinding samples during data collection and analysis is ideal. For data collection, it is usually as simple as having a labmate put tape over labels on your samples and label them with a dummy index. Insist that your coworker writes down the key, so later you can decode the dummy index back to the true sample information.

In cases where true blinding is impractical, the selection of cells to image/collect should be randomized (e.g. set random coordinates for the microscope stage) or otherwise designed to avoid bias (e.g. selecting cells using transmitted light if fluorescence the readout).

Failing to perform the data analysis blinded

Blinding during data analysis is generally very practical, even when the original data was not collected in a bias-free fashion. Ideally, image analysis would be done entirely by algorithms and computers, but often the most practical and effective approach is old-fashioned human eye. Ensuring your manual analysis isn’t biased is usually as simple as scrambling the image filenames.

I stumbled upon these ImageJ macros for randomizing and derandomizing image filenames, written by Martin Höhne: http://imagej.1557.x6.nabble.com/Macro-for-Blind-Analyses-td3687632.html

More recently, Christophe Leterrier directed me to Steve Royle‘s macro, which works very well: https://github.com/quantixed/imagej-macros#blind-analysis

There are probably some excellent solution using Python. Regardless of the approach you take, I would highly recommend copying all the data to a new folder before you perform any filename changes. Then test the program forward and backwards to confirm everything works as expected. Maybe perform analysis in batches, so in case something goes awry, you don’t lose all that work.

My “blind and dumb” experiment

There are many stories about unintended bias leading to false conclusions. Here’s mine: I was testing to see whether a drug treatment inhibited cells from crawling through a porous barrier by counting the number of cells that made it through the barrier to an adjacent well.

My partner in crime had labeled the samples with dummy indices, so I didn’t know which wells were treated and which were control. But I immediately could tell that there were more cells in one set of wells, so I presumed those were the control set. Fortunately, I had taken the extra precaution of randomizing the stage positions, so I didn’t let my bias alter the data collection. We then blinded the analysis by relabeling the microscopy images. I manually counted all the cells in each image.

We then unblinded the samples. At first, we were disappointed that the wells I had assumed were control turned out to be treated. Then we looked at the results. SURPRISE! My snap judgement at the beginning of the experiment had been precisely backwards: the wells I thought looked like they had sparser cells actually had significantly more on average. So it turned out that the drug treatment had indeed worked. Thankfully, I didn’t rely on my snap judgement nor allow that bias to influence the results.

Treating each cell as an n in the statistical analysis

This error plagues the majority of cell biology papers. Go scan a recent issue of your favorite journal and count the number of papers that have minuscule P values; invariably, the authors aggregated all the cell measurements from multiple experiments and calculated the t-test or ANOVA based on those dozens or hundreds of measurements. This is a fatal error.

It is patently absurd to consider neighboring cells in the same dish all treated simultaneously with the same drug as independent tests of a hypothesis.

If your neighbor told you that he ate a banana peal and it reversed his balding, you might be a little skeptical. If he further explained that he measured 1000 of his hair follicles before and after eating a banana peel and measured a P < 0.05 difference in growth rate, would you be convinced? Maybe it was just a fluke or noise that his hairs started growing faster. You would want him to repeat the experiment a few times (maybe even with different people) before you started believing.

Similarly, there are many reasons two dishes of cells might be different. To start believing that a treatment is truly effective, we all understand that we should repeat the experiment a few times and get similar results. Counting each cell measurement as the sample size n all but guarantees a small—but meaningless—P value.

Observe how dramatically different scenarios (on the right) yield the same plot and P value when you assume each cell is a separate n (on the left):

Elegant solutions include “hierarchical” or “nested” or “mixed effect” statistics. A simple approach is to separately pool the cell-level data from each experiment, then compare experiment-level means (n the case above, the n for each condition would be 3, not 300). For more details, please read my previous blog post or our paper:

Lord SJ, Velle KB, Mullins RD, Fritz-Laylin LK. SuperPlots: Communicating reproducibility and variability in cell biology. J Cell Biol. 2020;219(6):e202001064.
https://pubmed.ncbi.nlm.nih.gov/32346721

How do we fix this?

Professors need to teach their trainees the very basics about how to design experiments (see Stan Lazic’s book: Experimental Design for Laboratory Biologists) and perform analysis (see Mike Whitlock and Dolph Schluter’s book: The Analysis of Biological Data). PIs need to provide researchers with the tools to blind their experiments or otherwise remove bias. They need to ask for multiple biological replicates and correctly calculated P values. This does not require advanced understanding of statistics, just the basic understanding of the importance of repeating an experiment multiple times to ensure an observation is real.

Editors and referees need to demand correct data analysis. While asking researchers to redo an experiment isn’t really acceptable, requiring a reanalysis of the data after blinding or recalculating P values based on biological replicates seems fair. Editors should not even send manuscripts to referees if the above errors are not corrected or at least addressed in some fashion. Editors can offer the simple solutions listed above.

UPDATE: Also read my proposal to replace peer review with peer “replication.”

Powered by WordPress, Theme Based on "Pool" by Borja Fernandez
Entries and comments feeds. Valid XHTML and CSS.
^Top^