by David L. Duffy
I have been reading some papers from the philosophy of biology on “causal selection”. That is, the identification of those entities or variables that are the most important explanation of the behaviour of some feature of a complex biological system, which might be a single organism or a population of organisms. For example, geneticists single out (variation in) genes as the most important causative factor in biological evolution and development. The thesis of “causal parity” or “causal democracy”  is that such a focus is, roughly, an outcome of scientific short-sightedness rather a recognition of a natural kind.
Scientists and statisticians are as interested in the nature of causation as philosophers are. A key paper by Richard Lewontin is “The Analysis of Variance and the Analysis of Causes” (with the brief title “Analysis of Variance in Human Genetics”)  that argues the statistical machinery used by quantitative and population geneticists, outside of experimental data, gives little information about genes as causes. It was written in the midst of the controversies about the genetics of intelligence and the interpretation of the mean differences in IQ between racial groups arising from Jensen’s 1969 paper . The statistical approach he discusses is, like it says in the title, analysis of variance (ANOVA).
ANOVA and related approaches have been the workhorses of statistical analysis of scientific data since 1918, where it was introduced by Sir Ronald Fisher to reconcile Mendelian and Biometrical genetics. If I plug “ANOVA” into Google Scholar I obtain 2.5 million hits, though if I do the same in PhilPapers , I get 89, of which only nine discuss the relationship between the analysis of variance worldview (for that is what is) and causation. Northcott  says “[t]he philosophical literature specifically on causal efficacy is relatively sparse…weighing up the relative importance of genes and environment is a more complex matter than it was for gravity and electricity…However, all is not completely lost since in response to these conceptual difficulties biologists have developed instead a slightly different understanding of causal efficacy.” Northcott specifically follows the counterfactual interpretation of ANOVA, which statisticians generally have embraced starting from Rubin  and Holland . As Clark Glymour’s invited commentary  has it, “Holland’s paper is as much philosophical as it is statistics”. In the following, I will try and very simply sketch these methods, enough that you can follow Lewontin’s objections.
So what is different about analysis of variance, and how is it connected with causal inference? The first thing to say is that it both the name of a particular set of numerical techniques that preceded access to modern computers and the more general likelihood-based approach  that incorporates correlation and various types of linear regression. The variance is the average squared difference in a measured variable between all possible pairs of members of a given collection of individuals (a population) . So it is common to state that any causal conclusions derived from an analysis of variance are conclusions about the causes of individual difference, rather the causes of the measured variable itself. This then means such conclusions are restricted to the population under study, or a family of populations related to the observed population – a lot of statistical theory assumes we observe a sample or subset of a population, and we wish to make inferences about the rest of the population.
If one has multiple samples or populations, one compares the average squared distances within each population to the overall average squared distances across the entire collection of individuals. In the original numerical methods of (one-way) ANOVA, one calculates the Within-Group Mean Squares and Total Mean Squares, and the statistical tests are based on the (appropriately weighted) ratios of these types of summary statistics. In many settings, the different populations are defined by their value at a second variable, the putative causative factor. For assessment of causal efficacy or importance of the factors that define these groups, we standardize the magnitude of these ratios to give measures such as the R2, intraclass correlation, or in genetic settings, the heritability (h2).
If you are at all familiar with linear regression and ANOVA, you might be thinking I have completely skipped a part of this type of mathematical modelling. After all, these models are usually presented as applicable to an individual, providing an estimate of a systematic effect of the putative causative factor usually as a “fixed effect”, say 4 cm taller if one’s father comes from a high UK social class rather than low social class), along with an environmental (or “error”) variance representing how much the unmeasured factors could push one further up or down around that average effect (environmental noise) . But this numerical estimate assumes everything else stays the same. The homogeneity of variance assumption is that within-subpopulation variances are all equal, so that any differences are driven by differences in the mean between the groups eg people from high SES subpopulations are taller on average than those from low SES subpopulations. The additivity assumption is that a short person gains 4 cm in height as they move from low SES to high SES (counterfactually, or possibly by social mobility at the right age), in the same way a tall person would – all other things being equal. The no interaction assumption is that if there are multiple causes acting, if you did measure them they would just “add onto” the effects of your variable of interest, so the same increase in protein intake in shorts and talls would have a constant effect, and this would be the same in low SES and high SES.
Now one can generalize models to allow all these assumptions to be discarded, but you then need more information and things become messy to understand. Or, in many areas of application, you can hold all the other relevant factors constant eg feed all the animals the same food. Now back to inference about variances rather than means.
A genetic example
In a simple-minded genetic ANOVA of a collection of monozygotic (MZ, that is, genetically identical) twins who have been separated at birth and adopted into different families, we measure the average differences within each pair for a quantitative trait (the geneticist’s term for a metrical variable that is a property of an organism) and compare it to the average differences between all member of the collection. If the average within-pair difference is zero (so all members of an MZ twin are exactly the same height as their co-twin), and the total average difference greater than zero, the heritability, (VT -VW)/VT , is 1, while if the average within-pair difference is the same as the total average difference, the heritability will be zero. As per the additivity assumption above, this emphasis on difference is that it is invariant to causative factors that affect all population members equally. If everybody’s diet improves equally, then population mean height increases (per generation quite rapidly eg Eastern Europe) but causes of differences in height between individuals might remain completely unaffected.
To infer anything about causality in this observational natural experiment, we have to make various assumptions. For example, we assume that members of a twin pair are not more likely to be adopted into families similar in terms of environmental factors affecting height, say diet. If we assume that the fact that monozygotic twins are genetically identical is the sole causative pathway, then partitioning an existing population into twin families (as opposed to adoptive families) allows us to infer that genes are the important cause of differences between individuals in that population. If genes are not important, then this partitioning does not give rise to a significant test statistic. The counterfactuals here are all the other ways we could have partitioned the population (permutations or randomizations), and (one type of) statistical testing looks at the proportion of times we would have observed our outcome among all these other alternatives (a set of possible worlds differing only in which family the same set of individuals – with respect to height – dropped into).
One property of this type of analysis is that we don’t actually measure the proximate cause of changes in height, which is the whole suite of thousands of particular relevant genetic variants present in any one individual, but also all the external causal factors (diet etc). The more common classical twin study uses monozygotic (MZ) and dizygotic (DZ) twin pairs who have been reared together to calculate ratios of all four mean squares in order to estimate effects of both genes and family environment.
So what are Lewontin and others’ criticisms here? Firstly, that the relative size of genetic effects depend on what environment the organism is within. So the heritability, being a ratio, will fall or rise if genetic effects remain constant and non-genetic factors increase or decrease in size or the proportion of the population affected by any single cause alters. Similarly it will remain constant when genetic and non-genetic factors move together. Therefore, the criticism is that this is not a consistent measure of causal efficacy. Further, genes do interact with environmental causes in a nonlinear fashion – we know this is the case for some individual genes (norm of reaction). In the reared-apart MZ twin design, we know nothing about the environmental causes acting in any one individual, let alone how they might interact with the specific genotype of that individual. We are essentially averaging over all those different interactions. If MZ and DZ twins are available, these kind of effects do lead to different total variances in the MZ and DZ groups, but one might need very large sample sizes to reliably detect this difference.
A third point is that the variation in genetic or environmental effects is limited to what is actually present in the population. That is, it gives you no idea about what a single causative factor like a particular allele (genetic variant) might do if it became much more common in the population.
A second genetic example
When autosomal genes are transmitted from parents to offspring (in sexual populations like humans), only one allele (of two possibilities) is transmitted from each parent. This segregation appears to be well randomised – there are good reasons why segregation distorting genes will be generally selected against. This means that for a genetically controlled phenotype like height, the standardised ratio of mean squares in a partitioning of the population into parent-child families can never exceed half that for MZ twins.
In terms of causation, where does the difference between one parent and the child come from? Some of it is coming from the reduction of the parental two-allele (diploid) genotype to a one-allele (haploid) genotype in the transmitted gamete (sperm or egg). So the difference is (partly) arising from an absence. Is this just an artefact of our bookkeeping?
Before discussing this, I will mention the case of genes on the sex chromosomes. For humans, females are X/X (received one X chromosome from each parent), and males are X/Y (a Y was inherited from the father). The Y chromosome contains a very few genes that match up to those on the X (the pseudo-autosomal genes). Therefore, for almost all genes on the X for, say, height, the male genotype is haploid, and that in a female is diploid. To match up the gene dose in each sex, one X chromosome in the nuclei of cells in the tissues of a female is randomly inactivated, so females too end up haploid (but in a patchwork fashion, eg calico cats, colour blindness across the retina in carrier women). Again, the parent-child difference is due to absence of a gene.
So segregation of alleles in meosis and X-inactivation are physical processes that cause a reduction in the parent-offspring similarity by injecting randomness. In the former case this randomness can be removed by carrying out completely inbred matings, as in animal breeding experiments. Other animals can swap between sexual and asexual reproduction, switching between bearing an ordinary offspring or an “MZ twin” of themselves (which leads into the question of what is the value of randomization in reproduction). When I think of randomness, I think of a source of extra variance in a population – I can usually only know of it by the distribution of its effect over multiple observations.
In the figure below, one of the most famous in the history of statistics, Sir Francis Galton plots the stature of men versus their adult sons. The red dot is the mean height for each group, the black diagonal line would be where the points would fall for MZ twins with h2=1, the red line is the “line of best of best fit” predicting son from father, and the angle between red and black measures proportionate to half the heritability, where the “other half” is the random segregation variance.Father and son heights from Galton 1886
Back to the ANOVA viewpoint
The machinery of ANOVA underlies population and evolutionary genetics, with its emphasis on populations of organisms and of genotypes. For example, in the univariate Breeder’s Equation, the heritability of the trait, h2, determines the (short term) response to artificial or natural selection:
R = h2 S
The S is the selection coefficient, which is the difference in mean of the selected trait between the current total population and the proportion that will be mated to produce the next generation. The R is the difference in mean trait value between the resulting next generation and current total population. You can see from the Figure why it is so simple . There is no mention of the identity and number of genes actually involved in the causation of the trait – in fact two standard mathematical models assume either one gene, or an infinite number of genes each of tiny effect. One does needs a sample or population to estimate the heritability, but once estimated it can be applied to a single mating or parent-offspring pair from that population to predict the most likely offspring trait value. In the completely inbred case, there will be no response to selection, because any differences between individual trait values in the parental generation are completely environmental, that is, h2=0.
How does this fit in with with Lewontin’s points? In artificial selection, the goal-directed choice of inputs that drives the change in population trait mean is the selection process itself ie the partitioning of the population into those allowed to breed. However, this leads to zero effect unless the heritability is non-zero, that is unless the mechanics of meoisis transmit the parental phenotype to the offspring in the correct fashion. Famously, there are lots of methods of parent-offspring transmission  that will not lead to successful selection.
So in one sense, in the absence of a selective pressure, the heritability is a potentiality of that trait in that population with this particular array of environmental causes and distribution of genotypes. This sounds fairly abstract, and perhaps not a good candidate for being an entity directly relevant to causation. And since h2 is a ratio, the input of other causes, again say diet for height, can lead to shifts in response to selection of the population.
But in another sense, for any one individual, the heritability is a measure over the counterfactuals of how you could have developed conditional on having exactly the same set of genotypes you have, but in a different environment selected from the range of those represented in your current population. So if heritability is high, then the current range of environments doesn’t buffet this trait around much, again as a ratio comparing it to the total variance.
The interventionist theory of causation (eg Woodward 2003 ) has it that an entity is a direct cause if manipulation of it leads to alteration in downstream traits. In a third sense, my description of a high heritability trait above is that the effects of manipulation of the relevant genotypes are larger than those of manipulation of the environment, provided – again! – that those manipulations are of the same size as those previously seen in the relevant population. At the molecular genetic level, causation is a low bar, in that there are suggestions that every gene can affect every trait, and so one measure of causal importance is to scale this relative to the population trait variance.
What do we think of Lewontin’s criticisms overall? A “naive” hereditarianism holds that a high heritability trait is preordained from birth, that is, that modifications of the environment will not lead to great changes in the value of that trait. This was the conclusion that Jensen had come to regarding investment in education  that Lewontin was, in part, reacting to.
Now the first caveat we have seen is that environment-based interventions could be much larger than those seen on average in the population. One point of view in the “IQ wars” was that education being offered was grossly suboptimal, especially for disadvantaged kids. A simple ANOVA model might predict that decreasing the variation in quality of education will actually increase heritability, as the contribution from genes to the ratio will be relatively larger.
But our models need to be as complex as reality, precisely in that high IQ parents will offer better education to offspring, so our simple assumptions of “no interaction” are incorrect. In Kong and coworkers’ (2018)  study that measured genotypes at 120 genes to predict a person’s education level, one-quarter of the correlation between parent and child educational attainment was mediated by the alleles not transmitted during segregation. That is, parental educational achievement was predictable from genes (to some extent), higher parental education achievement leads to higher parental social position, so that so a proportion of the apparent heritability of the educational achievement in the children is indirectly mediated by parental social position, or rearing practices that track social position.
Is it incorrect to call the combined contributions of parental genes to the child’s educational outcome (direct and indirect) “genetic”? It is if one is trying to predict the size of response to an environmental intervention to the children (such as Head Start) . Or if one is making a political point about differences in attainment by school when thinking about social class and teacher pay rates. But the label may be less important than the causal pathway the Kong et al genetic analysis reveals – it suggests environmental interventions improving parental education should lead to improvements for the children in the same way a far less possible genetic change would. Like many claims in the social sciences, this might seem pretty obvious, but in fact we don’t see such indirect effects for many other similar traits.
So do nonexperimental studies tell us anything about causation? I think definitely. Nobody doubts what Galton’s graph of parent and offspring heights is telling us about the thousands of causative factors controlling human stature. Are such physical traits different in terms of complexity of causation, from say human behaviour, or weather? The ANOVA view of the world is that such a multiplicity is not a difference in kind.
David L. Duffy is a research scientist who works on the statistical genetics and genetic epidemiology of traits ranging from cancer to personality. As a result, he feels qualified to have an opinion on everything. He practiced medicine sometime back in the previous millennium, and has read far too much science fiction. You can see lists of publications and other stuff (even a couple of pastels) at https://genepi.qimr.edu.au/Staff/davidD/, and some of what he’s been reading (or doing) lately at http://users.tpg.com.au/davidd02/
 Weber M (2005). Genes, causation and intentionality. History and philosophy of the life sciences. Jan 1:407-20.
Kitcher P (2001). Battling the Undead. How (and How Not) to Resist Genetic Determinism. In: Singh R., Krimbas C., Paul D.B. and Beatty J. (ed.), Thinking About Evolution: Historical, Philosophical and Political Perspectives, Cambridge: Cambridge University Press, 396-414.
 Lewontin RC (1974). The analysis of variance and the analysis of causes. Am J Hum Genet 26: 400–411. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1762622/
 Jensen AR (1969). How much can we boost IQ and scholastic achievement? Harvard Educ Rev 39:1-123.
 Bourget D, Chalmers D (editors). PhilPapers. https://philpapers.org/s/%22analysis%20of%20variance%22. Accessed 2019-May-29.
 Northcott R (2006). Causal efficacy and the analysis of variance. Biology and Philosophy 21:253–276. http://philsci-archive.pitt.edu/15410/1/BioPhil2006.pdf
 Rubin DB (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. J Educational Psych 66:688-701. http://www.fsb.muohio.edu/lij14/420_paper_Rubin74.pdf
 Holland PW (1986). Statistics and causal inference. J Am Statist Assoc 81: 945-60.
 Holland PW, Glymour C, Granger C. Statistics and causal inference. ETS Research Report Series. 1985 Dec;1985(2):i-72.
 Likelihood based methods are those most closely related to information theory and Bayesianism about knowledge. Least-squares and moment-based approaches make fewer assumptions about distribution, but overlap with so-called pseudolikelihood methods.
 Usually, we use ordinary Euclidean distances, but all kinds of other ways of measuring distance can be applied, eg in kernel Hilbert spaces. The ambitious figure from Szekey & Rizzo (2017) actually refers to an approach that utilizes distances between distances. Sze´kely GJ, Rizzo ML. The Energy of Data. Annu. Rev. Stat. Appl. 2017. 4:447–79.
 Bann D, et al (2018). Socioeconomic inequalities in childhood and adolescent body-mass index, weight, and height from 1953 to 2015: an analysis of four longitudinal, observational, British birth cohort studies. The Lancet Public Health 3:e194-e203. https://doi.org/10.1016/S2468-2667(18)30045-8
 Because of the absence of specification about the genes involved (other than broadly), the breeder’s equation is so general that it applies not only to simulations of biology, but also to solving abstract mathematical models where one has the goal of maximising a figure of merit – genetic algorithms. In these, the genes are many rules or partial solutions to a problem, where we wish to combine these into a single best solution. Interested readers can also check out the very abstract information geometrical interpretations of Fisher’s Fundamental Theorem of Natural Selection (which explains why there is always more genetic variance to select upon). https://en.wikipedia.org/wiki/Fisher%27s_fundamental_theorem_of_natural_selection
 For example transgenerational epigenetic inheritance probably only lasts a few generations in the absence of persisting environmental selection. As it happens, presence and extent of epigenetic inheritance is under genetic control.
 Woodward, J. (2003), Making Things Happen. New York: Oxford University Press.
 Jensen AR (1969). How much can we boost IQ and scholastic achievement? Harvard Educ Rev 39:1-123.
 Kong A, Thorleifsson G, Frigge ML, Vilhjalmsson BJ, Young AI, Thorgeirsson TE, Benonisdottir S, Oddsson A, Halldorsson BV, Masson G, Gudbjartsson DF. The nature of nurture: Effects of parental genotypes. Science. 2018 Jan 26;359(6374):424-8.
This is a genome-wide association analysis, and does not rely on some of the assumptions that one has to make when making inferences from say a twin study or descriptive family study.
 cf. https://en.wikipedia.org/wiki/Attributable_fraction_for_the_population