The reproducibility crisis has had [its own Wikipedia page](https://en.wikipedia.org/wiki/Replication_crisis) since early 2015. A quick search of the term “replication crisis” in “[Google trends](https://trends.google.com/trends/explore?date=all&q=%2Fm%2F012mc030)” shows that this expression has been gaining in popularity since the first half of the 2010s. It has been [a common subject in editorials](https://www.nature.com/collections/prbfkwmwvz/) in the journal “Nature” in recent years. [A survey in the form of a questionnaire](https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970) issued to several hundred scientists from all fields was published in 2016 and has since been quoted by all articles dealing with the subject: Yes, according to the majority of scientists questioned, there is a problem replicating published scientific experiments; and yes, they believe this is a crisis, implying not only that this is a serious matter, but that it is a new one as well ([Baker, 2016](https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970)).
Reproducibility is often seen as a bare minimum in science, and yet, if you were to exaggerate slightly, you could argue that practically nothing is ever reproduced: nobody wants to try, and when someone does try, it doesn't work. What’s new is that this would appear to be prohibitive in the framework of sound scientific practices in the name of scientific credibility, and [initiatives](https://rescience.github.io/) aimed at remedying the issue are starting to crop up.
In this section, we will try to focus on the reproducibility crisis from the point of view of the history of science (as well as from the point of view of the philosophy of science and the sociology of science).
* The first chapter takes a look at the concept of reproducibility as raised in the history of science, introducing concepts such as tacit knowledge, virtual witness, literary technology and experimenter’s regress.
* The second chapter addresses the six categories of reproducibility (which differ depending on the scientific field) proposed by the philosopher Sabina Leonelli, reaching a conclusion on the problem of reproducibility being seen as a global gold standard.
* The third chapter deals with the narrative of the reproducibility crisis, attempting to place it within a historical context.
* The fourth chapter deals specifically with “computational reproducibility” in this context.
First and foremost, reproducibility poses a semantic problem. Not only are there various different terms for it (replicability? repeatability? checking? robustness?), but there are also various different meanings associated with each of these terms: Indeed, the exact meaning of reproducibility is rarely reproducible. This absence of consensus can probably be attributed to the diversity of scientific communities tackling this issue. A typology of the different terms and their uses has already been drafted ([Baker, 2016b)](https://www.nature.com/news/muddled-meanings-hamper-efforts-to-fix-reproducibility-crisis-1.20076), but a detailed analysis of where these words come from, the reasons for the differences between them and the cultures they correspond to would be an interesting research subject in itself. For the remainder of this text, we will try to remain agnostic, while attempting to highlight diversity.
Generally speaking, the literature in the history of science indicates that, although reproducibility does lead to more reliability in science, it is just one method among others, and is not always either sufficient or necessary. Steinle’s historic compilation ([Steinle, 2016](https://onlinelibrary.wiley.com/doi/10.1002/9781118865064.ch3)) proposes a range of situations.
Reproducibility is also complex because, although to all appearances it might seem simple, the concept raises a whole host of questions:
* Reproducible by whom? By yourself? By a colleague? A competitor? A reviewer? A verification body?
* Reproducible for what? To validate? To refute? To interpret?
* Reproducible how? Using the same instrumentation? The same protocol? The same conclusion using other means?
* Indeed, what is the “same”? Strict measurements? Consistent patterns of results? Generalising conclusions?
* And, lastly, when do we need this “sameness”? To verify? To demonstrate? To refute or contradict? To generalise?
Furthermore, should “reproduction” be hypothetical or actual? Generally speaking, scientists have no inclination to reproduce experiments carried out by others: Given that the reward for a scientific contribution is academic publication and given that the value of a publication lies in its originality, the question is often hypothetical. On the rare occasions that attempts are made at reproduction, this tends to be during controversial situations.
Paradoxically, despite the confusion surrounding it and the little amount of reproduction that takes place, reproducibility is often viewed as a bare minimum in good scientific practice, or even as a gold standard in science. It is important, for example, in criticism of publications, particularly in peer reviewing. The need for reproducibility is thus often held up as an unquestionable moral principle, while the conditions for keeping to this principle are often hypothetical. We have begun to see more and more criticism of reviewing, and proposals have been considered ([Ross-Hellauer, 2017](https://f1000research.com/articles/6-588/v2)), but this criticism, sometimes applied in avant-garde scientific journals, has run into the opposition against change of the most prestigious and most influential journals.
In the following four examples, taken from research carried out by historians of science, we will focus on a few concepts that have become standard and which illustrate how reproducibility can serve a number of functions, in what ways it is complicated to implement and what techniques have been used in order to make it more legitimate.
### Epistemic virtues: the viper’s venom
First and foremost, reproducing experiments can have different epistemic virtues depending on the context. In the case of research carried out on the effects of viper bites by scientists from the Accademia dei Cimento in Tuscany in the 16th century, the historian Jutta Schikore compares differing uses of the replication of experiments by Redi and his disciple Fontana ([Schikore, 2017](https://www.press.uchicago.edu/ucp/books/book/chicago/A/bo25793826.html)). While the first boasted that they had repeated their experiments with viper bites on frogs hundreds of times, this was both to deal with the uncertainty of working with different vipers and frogs and to discredit the results of his competitors. Fontana later attempted to understand the variability of the results by isolating cases which “didn't stick” and interpreting them. In doing so, he came up with a theory as to how the poison worked. Schikore thus shows that reproducibility can serve different epistemic functions in a scientific context.
### Reproducing experiments in the history of science: Joule’s link to breweries
In history of science, attempting to reproduce experiments from the past is also a way of revealing the context in which these experiments were carried out, moving beyond mere publication. The historian Otto Sibum, for example, addressed the experiments carried out by James Prescott Joule on the conversion of heat into work ([Sibum, 1995](https://www.sciencedirect.com/science/article/abs/pii/0039368194000369)). Joule’s results were not quantitatively reproducible (measurements of increases in the temperature of the water into which a weight was dropped were highly sensitive to atmospheric conditions inside the laboratory) and nothing was found in the relevant publications, notebooks or Joule’s private correspondence to indicate how to deal with the uncertainty surrounding this variability.
Sibum came to the conclusion that Joule’s connection with the world of beer brewing (a field in which industrialisation required expert knowledge of temperature regulation during fermentation) explained the tacit knowledge which enabled him to handle temperature variability during his experiments. Attempts at reproducing these experiments made it possible to clarify the concept of tacit knowledge: what the experimenter is unable to explain with regard to the success of their experiment, not through negligence, but because some knowledge cannot, in essence, be made explicit. Accordingly, reproducibility is an unattainable ideal.
### Gaining support: Leviathan and the Air-Pump
Depending on the situation, reproducibility also entails being able to convince other people, which has a significant impact in terms of what is accepted as being identical in reproduction as well as the techniques employed in order to gain support.
The advent of the air pump in the 17th century is a canonical example. The credibility of the experiments carried out by Otto von Guericke on vacuums relied on the spectacle. The exceptional nature of his public demonstrations and the complex and unique apparatus of the vacuum pump rendered reproduction infeasible for anyone other than Guericke (thus ensuring the success of his shows).
The experiments carried out by Robert Boyle on vacuums, based on the same experimental principle, but performed using different apparatus, needed something else in order to gain credibility. The success of his experiments was critically dependent on Hooke’s experimental expertise. In order to gain support for the results of his experiments, Boyle called upon carefully-selected witnesses (the gentlemen behind the foundation of the Royal Society, deemed trustworthy because of their aristocratic backgrounds). On top of this, the most precise possible written description of the experiments and apparatus in an official report certified by the gentlemen (the ancestor of the academic publication as we know it today) is what the historians Shapin and Schaeffer ([Shapin et Schaffer, 1989](https://press.princeton.edu/books/paperback/9780691178165/leviathan-and-the-air-pump)) termed a virtual witness, delivered in writing. It is worth pointing out that nobody, despite the best efforts of Huyghens in France most notably, was successfully able to reproduce a functional air pump using these publications: In this case, reproduction was overly dependent on tacit knowledge. The purpose of this literary technology was not (and still is not in modern publications) strictly reproducibility, but rather legitimacy.
### Studying controversies: Gravitational waves
This concept of tacit knowledge, described initially by Polanyi, was developed and categorised by the sociologist of science Harry Collins. Collins is the most referred figure when it comes to theorising about reproducibility. His case studies (the TEA laser, the Q factor of sapphire, attempts aimed at proving the existence of gravitational waves over 40 years) show the influence this tacit knowledge has over reproduction issues.
The SSK school (Sociology of Scientific Knowledge) has paid a particular interest to scientific controversies, arguing that they reveal more about what is really happening in science (knowledge in the making) than situations where everything goes seamlessly. Studying controversies increases the likeliness of understanding the forms of tacit knowledge which impact upon reproduction, which don't appear in publications and which come to prominence as a result of researchers contesting them.
In cases involving experimental tests aimed at proving the existence of gravitational waves (predicted by Einstein’s theory of relativity), Collins demonstrated that the experimental equipment (specifically the equipment used to process the signal-to-noise ratio produced by the experiment) created a whole host of problems (identified by the researchers attempting to reproduce the results), thus preventing a consensus from being reached. The controversy fizzled out without either side being able to convince the other ([Collins, 1985](https://www.press.uchicago.edu/ucp/books/book/chicago/C/bo3623576.html)). In experimenter’s regress (which is an allusion to the [regression argument](https://en.wikipedia.org/wiki/Regress_argument)), Collins identified a level of uncertainty which he found to be unsolvable. “To know whether an experiment has been well conducted, one needs to know whether it gives rise to the correct outcome. But to know what the correct outcome is, one needs to do a well-conducted experiment. But to know whether the experiment has been well conducted ...ad infinitum. Experimenter’s regress shows that experiment alone cannot force a scientist to accept a view that they are determined to resist". ([Collins, 2016](https://onlinelibrary.wiley.com/doi/10.1002/9781118865064.ch4)).
Collins’ “sociology of calibration” stresses the need for the credibility of the instruments and protocols used in experiments to be enhanced in order for conclusions based on the experimental results to be accepted by the relevant community of scientists. In the case of gravitational waves, proof of their existence was initially rejected in the 1970s, and was only finally accepted 40 years later. Shapin and Schaeffer drew on this contemporary example in their description of the “literary techniques” used by Boyle for his air pumps in his quest for legitimacy.
Chapter 2: Six categories of reproducibility
-----------------------------------------------
Sabina Leonelli is a philosopher of science whose focus is on “data-centric biology”, which she describes as scientific activity in the life sciences in the age of big data. She studies what she calls data journeys. Data, never “raw”, contains all of the theories, conditions, protocols, biases and cultures which had an impact on its production, but when it is reused it is always in other conditions, by researchers belonging to other cultures ([Leonelli, 2016](https://www.press.uchicago.edu/ucp/books/book/chicago/D/bo24957334.html)). With regard to reproducibility, in much the same way as the approach which takes into account the different epistemic cultures of different scientific fields, Leonelli has proposed six categories of scientific activity for which “reproducibility” does not necessarily have the same meaning or the same importance. ([Leonelli 2018](https://www.emerald.com/insight/content/doi/10.1108/S0743-41542018000036B009/full/html))
### 1 Computational Reproducibility
Computational reproducibility is the first form Leonelli considers, feeling it to be the one that lends itself most easily to a straightforward definition. In Leonelli’s own words, "a research project is computationally reproducible if a second investigator [..] can recreate the final reported results of the project, including key quantitative findings, tables, and figures, given only a set of files and written instructions”. In this view, the computational seems restricted to data processing, with the statistical question implicitly embedded into it. (See chapter 4 “computational reproducibility” for a discussion on the problems it poses). For Leonelli, this is the only field in which “absolute” reproducibility is both feasible and desirable.
### 2 Direct Experimental Reproducibility: Standardised Experiments
The second category concerns direct experimental reproducibility. This concerns experiments that are easier to control (she cites **clinical trials in medicine** and **high energy physics**) and where, as a result, reproducibility is an ideal in scientific practice: it is both desirable and essential. In this category, contrary (in Leonelli’s opinion) to the first, “the circumstances of data production are, by contrast, a primary concern for experimentalists”. In these fields, there is typically a great deal of control over conditions and an expectation of reproducibility with regard to the patterns of data produced by the experiment, if not the exact output data itself. Statistics also tend to be used in order to validate the reproduction of these patterns.
### 3 Scoping, Indirect and Hypothetical Reproducibility: Semi-Standardised Experiments
In her third category, “methods, set-up and materials used have been construed with ingenuity in order to yield very specific outcomes, and yet some significant parts of the set-up necessarily elude the controls set up by experimenters”. This concerns research using **model organisms** (lab rats), **social psychology** or **neuroscience**: what all of these have in common is that they cannot be fully standardised. The interesting conclusions come precisely from the non-standardised aspect of these experiments, similar to the repetition seen in Fontana’s experiments aimed at studying variability in the effects of viper bites. This was in contrast to Redi, where repetition had no purpose other than to deal with randomness. ([Schikore 2017](https://www.press.uchicago.edu/ucp/books/book/chicago/A/bo25793826.html)). In this slightly catch-all category, Leonelli proposes that the most significant reproducibility can be found in the “convergence across multiple lines of evidence, even when they are produced in different ways”, which science historians have termed triangulation or robustness ([Cartwright, 1991)](https://econpapers.repec.org/RePEc:hop:hopeec:v:23:y:1991:i:1:p:143-155): Arriving at consistent conclusions using experiments that have no link to each other.
### 4 Reproducible Expertise: Non-Standard Experiments and Research on Rare Materials
The following category relates to the exceptional: “cases where experimenters are studying new objects or phenomena (new organisms for instance) and/or employing newly devised, unique instruments that are precisely tailored to the inquiry at hand”. In such situations, significance is linked less to control over experimental conditions than it is to expertise in dealing with exceptional conditions: "focus less on controls and more on developing robust ways of evaluating the effects of their interventions and the relation between those effects and the experimental circumstances at the time in which data were collected". This aspect concerns sciences which deal with the rare: in **archaeology**, for example, repetition is neither possible nor applicable. “uniqueness and irreproducibility of the materials is arguably what makes the resulting data particularly useful as evidence”. In this category, epistemic virtue lies in the expertise: “reproducible expertise [...] as the expectation that any skilled experimenter working with the same methods and the same type of materials at that particular time and place would produce similar results.”
### 5 Reproducible Observation: Non-experimental case description
The last two categories concern observational sciences: “surveys, descriptions and case reports documenting unique circumstances”. Once again, expertise is key to “reproducible observation”. “Reproducibility of observation [is] the expectation that any skilled researcher placed in the same time and place would pick out, if not the same data, at least similar patterns”. Here, Leonelli cites both **sociology** and **radiology**: “structured interviewing, where researchers devise a relatively rigid framing for their interactions with informants; and diagnosis based on radiographies, resonance scans and other medical imaging techniques.”
The final category deals with scientific practices where “the idea of reproducibility has been rejected in favour of an embrace of the subjectivity and unavoidable context-dependence of research outcomes”. In **anthropology**, reproducibility is irrelevant: “Anthropologists cannot rely on reproducibility as an epistemic criterion for data quality and validity. They therefore devote considerable care to documenting data production processes”.
When reproducibility is irrelevant, scientific communities base their credibility on other epistemic virtues, including reflexivity (from which many scientific fields could draw inspiration). “Ethnographic work in anthropology, for instance, has developed methods to account for the fact that data are likely to change depending on time, place, subjects as well as researchers’ moods, experiences and interests. Key among such methods is the principle of reflexivity, which requires researchers to give as comprehensive a view of their personal circumstances”.
The conclusion Leonelli draws based on these six categories is that the demand for reproducibility (as a means of ensuring reliability) creates problems, particularly when given a narrow definition, based on precepts which only have any meaning within a particular field. It threatens the vitality of various different scientific fields where such a demand might be irrelevant or even counter-productive. Some go further, seeing this “one size fits all” approach to reproducibility as an attempt to ghettoise those scientific domains which do not correspond to a standard too readily accepted as being universal ([Penders et al. 2019](https://www.mdpi.com/2304-6775/7/3/52)).
Chapter 3: Crisis and crisis discourse
---------------------------------------------
The narrative of the reproducibility crisis raises a question: Why is the crisis happening now, and why did it start in the 2010s? Why is this crisis affecting so many separate fields (psychology, epidemiology, computational science, etc.)? How is this linked to the narrative of open science? Or to the crisis of open access to publications and the issue surrounding access to data ([Hocquet,2018](https://theconversation.com/debat-l-open-science-une-expression-floue-et-ambigue-108187))? What is at stake? How and why have these aspects of the discourse been taken up/launched/amplified by institutions (learned societies, national institutions, etc.)? What prescriptive vision of what “good science” should be does this portray? How is this narrative linked to a crisis of confidence (among both citizens and institutions) with regard to science and how are institutions able to manage that? Although it does not answer all of these questions, this brief memo strives to offer some hints.
### The crisis in the media
At first glance, the warning signs and the way in which they were sent would suggest that there was a link or, at least, some form of simultaneity, between the Open Access movement, the rebellion against the oligopolies of scientific publishers and their habit of making scientific literature inaccessible to mere mortals (or even mere researchers): One of the rallying cries of this movement is for there to be more transparency in the sciences. By extension, the Open Science movement claims that scientific experiments should be reproducible as part of this demand for transparency.
Reproducibility is seen as the gold standard, a yardstick for gauging trust in science - not only on the part of researchers, but also on the part of scientific funding institutions and citizens. The link between academic publishing, transparency and reproducibility is particularly significant in the criticism of peer-reviewing, which has accompanied the Open Access movement.
An in-depth understanding of the phenomenon would entail a study drawing on publications with a good grasp of the subject, not just in scientific journals, in the press or in publications at the interface between these two worlds (such as Nature), but also in press releases from institutions (learned societies, funding organisations, etc.). Such a study would also make it possible to trace more specific genealogies within different scientific fields, which will each be affected by the crisis in a different way.
There are also those sceptical of the narrative of crisis ([Fanelli, 2018](https://www.pnas.org/content/115/11/2628)), particularly among those questioning the spread of irreproducible results and their recent rise. Science is evolving, and with it its own criteria for reliability, particularly in terms of what is considered as being significant in statistics.
Based on a quick analysis of the crisis as it has been narrated (lacking in detail as it is), there are three model dynamics that can be proposed for the media coverage of the crisis, all linked, but in three different fields. That said, many other scientific fields (economics, computational science, etc.) are subject to this crisis narrative.
### In psychology
On one hand, in psychology, the reproducibility of scientific studies is often contested, primarily because these studies sometimes appear in the mainstream press: they are media-friendly, and so they get exposure. A well-known example of media hype around this question was the “[feeling the future experiment](https://en.wikipedia.org/wiki/Daryl_Bem#%22Feeling_the_Future%22_controversy)” in 2011. On the other hand, psychology is a scientific field that is often on the defensive, constantly asked to justify its scientific legitimacy (see, for example, the controversy surrounding the results of the Milgram experiment or the Stanford prison experiment).
The Reproducibility Project was launched in 2011, followed in 2013 by the Center for Open Science, an operation aimed at reproducing experiments in psychology involving an entire scientific community. Many papers have been published on this subject in journals in the field, summarized in a manifesto entitled “The seven deadly sins of psychology: a manifesto for reforming the culture of scientific practice” ([Chambers, 2017](https://press.princeton.edu/books/hardcover/9780691158907/the-seven-deadly-sins-of-psychology)). In psychology, the reproducibility crisis corresponds to a form of introspection on the part of this scientific field, the goal being to collectively define good scientific practice. An example of this can be found in the requirement for preregistration for upcoming studies in order to reduce the risk of p-hacking ([Adam, 2019](https://www.sciencemag.org/news/2019/05/solution-psychology-s-reproducibility-problem-just-failed-its-first-test)). Uljana Feest, a philosopher of social sciences, proposed that this scientific field redefines its practices as exploration as opposed to futile attempts at “reproduction” in order to break the deadlock ([Feest, 2016](https://www.journals.uchicago.edu/doi/abs/10.1086/705451)) in a bluntly-titled article: “Why replication is overrated".
### Clinical trials
The other field concerns clinical trials in medicine. The historian of science Nicole Nelson ([Nelson, 2019](https://www.radcliffe.harvard.edu/event/2019-nicole-c-nelson-fellow-presentation)) has proposed a genealogy of the crisis in the world of clinical trials. In 2012, the media coverage of a paper claiming to have attempted to reproduce dozens of clinical trials on cancer treatments with a failure rate close to 90% caused an outcry ([Begley & Ellis, 2012](https://www.nature.com/articles/483531a)). What was peculiar about this paper is that its authors were affiliated to a private biomedical company. It is widely known that researchers rarely have any interest in wasting their time trying to reproduce others’ results (with the exception of competitions or disputes surrounding specific controversies), since this will not have any benefits in terms of originality or publications. For private researchers, this is even more surprising, since they do not stand to gain anything financially either.
The history of clinical trials in medicine helps us to understand the reasons for this. “[Evidence-based medicine](https://en.wikipedia.org/wiki/Evidence-based_medicine)” is a policy (particularly in the USA) which first appeared towards the end of the 20th century and which is based on the hope of rationalising the decision-making process in medicine. This shone a comparative new spotlight on the meta-analysis carried out on clinical trials (primarily via forest plots). One consequence of this was a growing doubt surrounding the validity of clinical studies, faced with results which sometimes appeared contradictory. In the early 2000s, much of the focus was on bias linked to privately-funded research, climaxing in the book “The Truth About Drug Companies: How They Deceive Us and What to Do" ([Angell, 2004](https://www.penguinrandomhouse.com/books/3901/the-truth-about-the-drug-companies-by-marcia-angell-md/)) which came from the academic medical community. Recent evidence, from privately-funded research, that reproducibility is as much of an issue in public research as it is in private research would suggest that this lack of reproducibility is linked to factors other than financial interests. Nelson sees this as a sort of counter-attack from the pharmaceutical industry in an attempt to move away from the role of villain in which they’ve been cast.
### Metascience
The other challenge comes from statisticians. The paper published by Ioannidis in 2005 “Why Most Published Research Findings Are False” (a particularly nuanced title) is by far the most cited ([Ioannidis, 2005](https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124)). So much so that, since that time, a self-proclaimed scientific field called “[metascience](https://en.wikipedia.org/wiki/Metascience)”, dreamt up by statisticians with the goal of analysing reproducibility issues in science, has focused exclusively on issues raised by statistics, analysing the question exclusively from the perspective of good statistical practices and putting them on a pedestal as THE scientific method, as though it were unique.
Indeed, it is striking that the practically unanimous media coverage of the crisis has concerned the statistical processing of results from experiments, not just in the two fields discussed here, but in the majority of others as well. The omnipresence of statistical processing, chiefly as a scientific practice that is now widely used outside of expert circles, but also as a subject for debate in the media, is the third dynamic in question. Epistemic problems surrounding reproducibility are far more diverse than this question alone, but statistical reproducibility gets by far the most media coverage.
Chapter 4: Computational reproducibility
-------------------------------------------------
Computational reproducibility, i.e. the aspect in which the computer has a role to play, is what is of interest to us here. On one hand, in the narrative of the crisis (as has featured in the media) there is a tendency for this to get lost in amongst the statistical crisis. Let's consider three different domains in which reproducibility is an issue (despite these intermingling) : the experimental, the statistical and the computational. The statistical is without any doubt very much the star of the show. Furthermore, in her categorisation, Leonelli has a tendency to get the computational and the statistical mixed up, while reducing the computational to data processing.
The statistical tends to be more involved in the most media-friendly fields (clinical trials and psychology). The statistical is more widely-used by non-experts and is easier to dramatise, which is why it gets more media coverage. The appearance in the media of “metascience” experts, such as Ioannidis, surfing on the wave of this reproducibility crisis, is only confirming this trend.
### Experimental reproducibility
From a historical perspective, experimental reproducibility was the first to appear (see chapter 1). The different aspects characterising it depend on each scientific field, meaning it is difficult to sketch out general principles. The philosopher John Norton argues that there is no general theory for inductive reasoning ([Norton, 2010](https://www.journals.uchicago.edu/doi/abs/10.1086/656542)), and that each scientific field has its own methodological criteria, particularly with regard to reproducibility (see chapter 2).
By observing particle physicists and immunologists, the sociologist Karin Knorr-Cetina pointed out that different scientific fields have different “**epistemic cultures**”, which runs contrary to the idea of there being just one scientific method ([Knorr-Cetina, 1999](https://www.hup.harvard.edu/catalog.php?isbn=9780674258945)). This is what philosophers of science have termed “the disunity of science” ([Galison & Stump, 1996](https://www.sup.org/books/title/?id=2121)). In much the same way, Collins, by invoking experimenter’s regression, states that the question of reproducibility must first and foremost solve the problem surrounding consensus on what constitutes the same “experimental space”, defining what a group of researchers might agree on with regard to experimental validity. Within this space, consensus on reproducibility may emerge, but it will never be universal.
In “How experiments end”, Galison pits two categories of scientists working in the same field, particle physics, against each other: those basing their trust on the trajectories actually observed in the bubble chambers; and those who have more faith in Monte Carlo statistical processing and the repetition of calculations (given that anything can happen once). We have here two different approaches for legitimacy that are cohabiting, the experimental and the statistical (or even computational), with different strategies for defining what can be considered reliable within the same scientific field ([Galison, 1987](https://www.press.uchicago.edu/ucp/books/book/chicago/H/bo5969426.html)).
### The genealogy of computational reproducibility
That said, reproducibility as a standard or as a requirement is often linked rhetorically to experimentation and to experimental discourse. The techniques employed by Boyle (see chapter 1) in order to convince people and to gain legitimacy were behind the foundation of the Royal Society and the concept of academic publishing, and have become the archetype for universal reproducibility through the publication, proclaimed as the gold standard. Chronologically, computational reproducibility only came afterwards, obviously, yet computerisation helped to transform scientific practice in a wide variety of scientific fields. This was in some ways the “beginning of a new era” ([Nordmann et al., 2011](https://upittpress.org/books/9780822961635/)), with its own epistemic, technical (computerisation, the transformation of instrumentation, the increase in the use of statistical processing in experimentation), industrial and economic aspects (the emergence of scientific entrepreneurship, technology transfer, etc.) ([Berman, 2011](https://press.princeton.edu/books/hardcover/9780691147086/creating-the-market-university)).
While statistical reproducibility is the topic of much discussion (see chapter 3), there is also a part of reproducibility that is computational without being statistical, and which often gets invisibilized into, or even confused with statistics. Whether it’s in data processing, modelling, computer science, or in practically all electronic appliances used in instrumentation, computation is omnipresent in science, often without scientists being aware of it.
Furthermore, computational calculation is often reduced, including by exponents of it, to data processing using calculation, leading to confusion between the computational and the statistical. Indeed, the computational is often reduced to “that which processes data”: According to Goodman, Fanelli and Ioannidis, “Scientist Jon Claerbout coined the term and associated it with a software platform and set of procedures that permit the reader of a paper to see the entire processing trail from the raw data and code to figures and tables” ([Goodman et al., 2016](https://stm.sciencemag.org/content/8/341/341ps12)). However, even if they are entangled, the statistical and the computational do not raise the same issues regarding reproducibility.
### The characteristics of computational reproducibility
Computational reproducibility suffers from a lack of recognition on the part of the wider public: many have a superficial image of it as being infallible (2+2=4 is always true), and yet it too had its own “public crisis” moment during the “[climategate affair](https://fr.wikipedia.org/wiki/Incident_des_courriels_du_Climatic_Research_Unit)”, which was followed by the [Science Code Manifesto](http://sciencecodemanifesto.org/) in 2011. Leaked emails from climate science researchers spread panic throughout the community, where credibility is a major political issue. The most controversial aspect to climategate was the use of “tricks” in the programming of models (the emails in question mentioned lines of code including subroutines known as “tricks”), which brought awareness of the problems posed by scientific programming and culminated with the Science Code Manifesto in 2011 ([Barnes, 2010](https://www.nature.com/articles/467753a)).
The majority of the characteristics of computational reproducibility concern software. The mutual dependence of computing libraries can prove a nightmare when it comes to reproducing a calculation ([Hinsen, 2018](https://www.practicereproducibleresearch.org/case-studies/khinsen.html)), as is shown by the recent example in [this paper](https://pubs.acs.org/doi/10.1021/acs.orglett.9b03216) of calculations carried out in chemistry, where the results were different depending on whether a Mac or a PC was used. ([Gallagher, 2019](https://arstechnica.com/information-technology/2019/10/chemists-discover-cross-platform-python-scripts-not-so-cross-platform/)) Like data curation, the programming work (in addition to compilation, distribution, licensing, etc.) carried out by researchers is not rewarded by publication (unless it is in relation to their own research) ([Hocquet and Wieber, 2017](https://ieeexplore.ieee.org/document/8268025)). Indeed, the bulk of computational activity is carried out by scientists for whom this is not their trade: neither is it in coding, management, distribution or licensing.
This is the difference between computing science (which produces and publishes programs) and computational science (in which programs are used). Furthermore, the software industry in general has been “in crisis” since the sixties ([Ensmenger, 2011](https://mitpress.mit.edu/books/computer-boys-take-over)): Unlike hardware, which has become increasingly powerful (see Moore’s law), software is still lagging behind and is always dearer than expected. On top of that, interoperability is ever more complicated, and stability and consistency have never been achieved: as Ensmenger says, instead of being a turning point, the “[software crisis](https://en.wikipedia.org/wiki/Software_crisis)” has become an art de vivre, from the sixties right up to the present day.
Researchers are preoccupied with defining and organising good practice ([Bénureau et Rougier, 2018](https://www.frontiersin.org/articles/10.3389/fninf.2017.00069/full)) ([Stodden, 2016](https://onlinelibrary.wiley.com/doi/10.1002/9781118865064.ch9)), and those good practices are often inspired by “the essential freedoms of free software”, establishing a direct link between the principles of open source software and “open science”.Yet, in practice, scientific software can be the subject of tensions between academic and commercial standards, linked to this absence of any reward.
Those tensions materialize into the sale of packages (potentially encouraged by universities’ “technology transfer” policies) ([Hocquet et Wieber 2017](https://ieeexplore.ieee.org/document/8268025), or, inversely, enormous amounts of time and energy spent to produce open source software, the inability to imagine a “business model” capable at the same time of satisfying epistemic demands for transparency, and also of protecting the code out of fear of the competition, or due to concerns over software stability. Those tensions are also observed in licensing policies that vary depending on whether the users are from academia or industry, or the tinkering of parameters found in models depending on the targeted users ([Wieber et Hocquet, 2018](https://arxiv.org/abs/1812.00995)).
We saw earlier that we can describe two visions of experimentation with differing relationships to the reliability of scientific instruments, one based on the transparency of the “home” instrument as a guarantee of reliability, the other based on trust in a commodified instrument for which use has become standard through widespread distribution. We can similarly describe two different visions of computational scientific reliability. The first one sees software as being “user-oriented”. In such cases, programs are seen as tools on which we can lift up the hood to see how they work: transparency is the epistemic virtue used to ensure reproducibility, as opposed to a “black box”. The other sees software as being “market-oriented” - it is produced in an industrial context and trust is based on the robustness of an industrial product with a standard form, even if it is proprietary. In this second case, reproducibility is based on the assurance of reliability through an industrial standard being imposed (Hocquet and Wieber, 2020). In particuler, the proliferation of different versions of the same program is seen as a problem for reproducibility in that it has a negative impact on the stability of these versions and, as a result, on standardisation. The paradox here is that the possibility of reusing and creating other versions is one of the principles of opensource software on which “open science” is based. Open and reproducible are not necessarily synonymous.
In conclusion, this overview of reproducibility should not prevent anyone from trying to come up with ways of building trust in science. On the contrary, questioning practices, standards and perceptions within scientific activity, as well as in computation, should help in achieving this goal.