# Biostatistics

Biostatistics (or biometry) is the application of statistics to a wide range of topics in biology. The science of biostatistics encompasses the design of biological experiments, especially in medicine, pharmacy, agriculture and fishery; the collection, summarization, and analysis of data from those experiments; and the interpretation of, and inference from, the results. A major branch of this is medical biostatistics,[1] which is exclusively concerned with medicine and health.

## History

Biostatistical reasoning and modeling were of critical importance to the foundation theories of modern biology. In the early 1900s, after the rediscovery of Gregor Mendel's Mendelian inheritance work, the gaps in understanding between genetics and evolutionary Darwinism led to vigorous debate among biometricians, such as Walter Weldon and Karl Pearson, and Mendelians, such as Charles Davenport, William Bateson and Wilhelm Johannsen. By the 1930s, statisticians and models built on statistical reasoning had helped to resolve these differences and to produce the neo-Darwinian modern evolutionary synthesis.

The leading figures in the establishment of population genetics and this synthesis all relied on statistics and developed its use in biology.

These individuals and the work of other biostatisticians, mathematical biologists, and statistically inclined geneticists helped bring together evolutionary biology and genetics into a consistent, coherent whole that could begin to be quantitatively modeled.

In parallel to this overall development, the pioneering work of D'Arcy Thompson in On Growth and Form also helped to add quantitative discipline to biological study.

Despite the fundamental importance and frequent necessity of statistical reasoning, there may nonetheless have been a tendency among biologists to distrust or deprecate results which are not qualitatively apparent. One anecdote describes Thomas Hunt Morgan banning the Friden calculator from his department at Caltech, saying "Well, I am like a guy who is prospecting for gold along the banks of the Sacramento River in 1849. With a little intelligence, I can reach down and pick up big nuggets of gold. And as long as I can do that, I'm not going to let any people in my department waste scarce resources in placer mining."[2]

## Scope and training programs

Almost all educational programmes in biostatistics are at postgraduate level. They are most often found in schools of public health, affiliated with schools of medicine, forestry, or agriculture, or as a focus of application in departments of statistics.

In the United States, where several universities have dedicated biostatistics departments, many other top-tier universities integrate biostatistics faculty into statistics or other departments, such as epidemiology. Thus, departments carrying the name "biostatistics" may exist under quite different structures. For instance, relatively new biostatistics departments have been founded with a focus on bioinformatics and computational biology, whereas older departments, typically affiliated with schools of public health, will have more traditional lines of research involving epidemiological studies and clinical trials as well as bioinformatics. In larger universities where both a statistics and a biostatistics department exist, the degree of integration between the two departments may range from the bare minimum to very close collaboration. In general, the difference between a statistics program and a biostatistics program is twofold: (i) statistics departments will often host theoretical/methodological research which are less common in biostatistics programs and (ii) statistics departments have lines of research that may include biomedical applications but also other areas such as industry (quality control), business and economics and biological areas other than medicine.

## Recent developments in modern biostatistics

The advent of modern computer technology and relatively cheap computing resources have enabled computer-intensive biostatistical methods like bootstrapping and resampling methods. Furthermore, new biomedical technologies like microarrays, next generation sequencers (for genomics) and mass spectrometry (for proteomics) generate enormous amounts of (redundant) data that can only be analyzed with biostatistical methods. For example, a microarray can measure all the genes of the human genome simultaneously, but only a fraction of them will be differentially expressed in diseased vs. non-diseased states. One might encounter the problem of multicolinearity: Due to high intercorrelation between the predictors (in this case say genes), the information of one predictor might be contained in another one. It could be that only 5% of the predictors are responsible for 90% of the variability of the response. In such a case, one would apply the biostatistical technique of dimension reduction (for example via principal component analysis). Classical statistical techniques like linear or logistic regression and linear discriminant analysis do not work well for high dimensional data (i.e. when the number of observations n is smaller than the number of features or predictors p: n < p). As a matter of fact, one can get quite high R2-values despite very low predictive power of the statistical model. These classical statistical techniques (esp. least squares linear regression) were developed for low dimensional data (i.e. where the number of observations n is much larger than the number of predictors p: n >> p). In cases of high dimensionality, one should always consider an independent validation test set and the corresponding residual sum of squares (RSS) and R2 of the validation test set, not those of the training set.

In recent times, random forests have gained popularity. This technique, invented by the statistician Leo Breiman, generates a lot of decision trees randomly and uses them for classification (In classification the response is on a nominal or ordinal scale, as opposed to regression where the response is on a ratio scale). Decision trees have of course the advantage that you can draw them and interpret them (even with a very basic understanding of mathematics and statistics). Random Forrests have thus been used for clinical decision support systems.

Gene Set Enrichment Analysis (GSEA) is a new method for analyzing biological high throughput experiments. With this method, one does not consider the perturbation of single genes but of whole (functionally related) gene sets. These gene sets might be known biochemical pathways or otherwise functionally related genes. The advantage of this approach is that it is more robust: It is more likely that a single gene is found to be falsely perturbed than it is that a whole pathway is falsely perturbed. Furthermore, one can integrate the accumulated knowledge about biochemical pathways (like the JAK-STAT signaling pathway) using this approach.