Analysis of Variance and Causation

by David L. Duffy

Energy, matter and mind, after Szekely & Rizzo 2017

I have been reading some papers from the philosophy of biology on “causal selection”. That is, the identification of those entities or variables that are the most important explanation of the behaviour of some feature of a complex biological system, which might be a single organism or a population of organisms. For example, geneticists single out (variation in) genes as the most important causative factor in biological evolution and development. The thesis of “causal parity” or “causal democracy” [1] is that such a focus is, roughly, an outcome of scientific short-sightedness rather a recognition of a natural kind.

Scientists and statisticians are as interested in the nature of causation as philosophers are. A key paper by Richard Lewontin is “The Analysis of Variance and the Analysis of Causes” (with the brief title “Analysis of Variance in Human Genetics”) [2] that argues the statistical machinery used by quantitative and population geneticists, outside of experimental data, gives little information about genes as causes. It was written in the midst of the controversies about the genetics of intelligence and the interpretation of the mean differences in IQ between racial groups arising from Jensen’s 1969 paper [3]. The statistical approach he discusses is, like it says in the title, analysis of variance (ANOVA).

ANOVA and related approaches have been the workhorses of statistical analysis of scientific data since 1918, where it was introduced by Sir Ronald Fisher to reconcile Mendelian and Biometrical genetics. If I plug “ANOVA” into Google Scholar I obtain 2.5 million hits, though if I do the same in PhilPapers [4], I get 89, of which only nine discuss the relationship between the analysis of variance worldview (for that is what is) and causation. Northcott [5] says “[t]he philosophical literature specifically on causal efficacy is relatively sparse…weighing up the relative importance of genes and environment is a more complex matter than it was for gravity and electricity…However, all is not completely lost since in response to these conceptual difficulties biologists have developed instead a slightly different understanding of causal efficacy.” Northcott specifically follows the counterfactual interpretation of ANOVA, which statisticians generally have embraced starting from Rubin [6] and Holland [7]. As Clark Glymour’s invited commentary [8] has it, “Holland’s paper is as much philosophical as it is statistics”. In the following, I will try and very simply sketch these methods, enough that you can follow Lewontin’s objections.

So what is different about analysis of variance, and how is it connected with causal inference? The first thing to say is that it both the name of a particular set of numerical techniques that preceded access to modern computers and the more general likelihood-based approach [9] that incorporates correlation and various types of linear regression. The variance is the average squared difference in a measured variable between all possible pairs of members of a given collection of individuals (a population) [10]. So it is common to state that any causal conclusions derived from an analysis of variance are conclusions about the causes of individual difference, rather the causes of the measured variable itself. This then means such conclusions are restricted to the population under study, or a family of populations related to the observed population – a lot of statistical theory assumes we observe a sample or subset of a population, and we wish to make inferences about the rest of the population.

If one has multiple samples or populations, one compares the average squared distances within each population to the overall average squared distances across the entire collection of individuals[11]. In the original numerical methods of (one-way) ANOVA, one calculates the Within-Group Mean Squares and Total Mean Squares, and the statistical tests are based on the (appropriately weighted) ratios of these types of summary statistics. In many settings, the different populations are defined by their value at a second variable, the putative causative factor. For assessment of causal efficacy or importance of the factors that define these groups, we standardize the magnitude of these ratios to give measures such as the R2, intraclass correlation, or in genetic settings, the heritability (h2).

If you are at all familiar with linear regression and ANOVA, you might be thinking I have completely skipped a part of this type of mathematical modelling. After all, these models are usually presented as applicable to an individual, providing an estimate of a systematic effect of the putative causative factor usually as a “fixed effect”, say 4 cm taller if one’s father comes from a high UK social class rather than low social class), along with an environmental (or “error”) variance representing how much the unmeasured factors could push one further up or down around that average effect (environmental noise) [12]. But this numerical estimate assumes everything else stays the same. The homogeneity of variance assumption is that within-subpopulation variances are all equal, so that any differences are driven by differences in the mean between the groups eg people from high SES subpopulations are taller on average than those from low SES subpopulations. The additivity assumption is that a short person gains 4 cm in height as they move from low SES to high SES (counterfactually, or possibly by social mobility at the right age), in the same way a tall person would – all other things being equal. The no interaction assumption is that if there are multiple causes acting, if you did measure them they would just “add onto” the effects of your variable of interest, so the same increase in protein intake in shorts and talls would have a constant effect, and this would be the same in low SES and high SES.

Now one can generalize models to allow all these assumptions to be discarded, but you then need more information and things become messy to understand. Or, in many areas of application, you can hold all the other relevant factors constant eg feed all the animals the same food. Now back to inference about variances rather than means.

A genetic example

In a simple-minded genetic ANOVA of a collection of monozygotic (MZ, that is, genetically identical) twins who have been separated at birth and adopted into different families, we measure the average differences within each pair for a quantitative trait (the geneticist’s term for a metrical variable that is a property of an organism) and compare it to the average differences between all member of the collection. If the average within-pair difference is zero (so all members of an MZ twin are exactly the same height as their co-twin), and the total average difference greater than zero, the heritability, (VT -VW)/VT , is 1, while if the average within-pair difference is the same as the total average difference, the heritability will be zero. As per the additivity assumption above, this emphasis on difference is that it is invariant to causative factors that affect all population members equally. If everybody’s diet improves equally, then population mean height increases (per generation quite rapidly eg Eastern Europe) but causes of differences in height between individuals might remain completely unaffected.

To infer anything about causality in this observational natural experiment, we have to make various assumptions. For example, we assume that members of a twin pair are not more likely to be adopted into families similar in terms of environmental factors affecting height, say diet. If we assume that the fact that monozygotic twins are genetically identical is the sole causative pathway, then partitioning an existing population into twin families (as opposed to adoptive families) allows us to infer that genes are the important cause of differences between individuals in that population. If genes are not important, then this partitioning does not give rise to a significant test statistic. The counterfactuals here are all the other ways we could have partitioned the population (permutations or randomizations), and (one type of) statistical testing looks at the proportion of times we would have observed our outcome among all these other alternatives (a set of possible worlds differing only in which family the same set of individuals – with respect to height – dropped into).

One property of this type of analysis is that we don’t actually measure the proximate cause of changes in height, which is the whole suite of thousands of particular relevant genetic variants present in any one individual, but also all the external causal factors (diet etc). The more common classical twin study uses monozygotic (MZ) and dizygotic (DZ) twin pairs who have been reared together to calculate ratios of all four mean squares in order to estimate effects of both genes and family environment.

So what are Lewontin and others’ criticisms here? Firstly, that the relative size of genetic effects depend on what environment the organism is within. So the heritability, being a ratio, will fall or rise if genetic effects remain constant and non-genetic factors increase or decrease in size or the proportion of the population affected by any single cause alters. Similarly it will remain constant when genetic and non-genetic factors move together. Therefore, the criticism is that this is not a consistent measure of causal efficacy. Further, genes do interact with environmental causes in a nonlinear fashion – we know this is the case for some individual genes (norm of reaction). In the reared-apart MZ twin design, we know nothing about the environmental causes acting in any one individual, let alone how they might interact with the specific genotype of that individual. We are essentially averaging over all those different interactions. If MZ and DZ twins are available, these kind of effects do lead to different total variances in the MZ and DZ groups, but one might need very large sample sizes to reliably detect this difference.

A third point is that the variation in genetic or environmental effects is limited to what is actually present in the population. That is, it gives you no idea about what a single causative factor like a particular allele (genetic variant) might do if it became much more common in the population.

A second genetic example

When autosomal genes are transmitted from parents to offspring (in sexual populations like humans), only one allele (of two possibilities) is transmitted from each parent. This segregation appears to be well randomised – there are good reasons why segregation distorting genes will be generally selected against. This means that for a genetically controlled phenotype like height, the standardised ratio of mean squares in a partitioning of the population into parent-child families can never exceed half that for MZ twins.

In terms of causation, where does the difference between one parent and the child come from? Some of it is coming from the reduction of the parental two-allele (diploid) genotype to a one-allele (haploid) genotype in the transmitted gamete (sperm or egg). So the difference is (partly) arising from an absence. Is this just an artefact of our bookkeeping?

Before discussing this, I will mention the case of genes on the sex chromosomes. For humans, females are X/X (received one X chromosome from each parent), and males are X/Y (a Y was inherited from the father). The Y chromosome contains a very few genes that match up to those on the X (the pseudo-autosomal genes). Therefore, for almost all genes on the X for, say, height, the male genotype is haploid, and that in a female is diploid. To match up the gene dose in each sex, one X chromosome in the nuclei of cells in the tissues of a female is randomly inactivated, so females too end up haploid (but in a patchwork fashion, eg calico cats, colour blindness across the retina in carrier women). Again, the parent-child difference is due to absence of a gene.

So segregation of alleles in meosis and X-inactivation are physical processes that cause a reduction in the parent-offspring similarity by injecting randomness. In the former case this randomness can be removed by carrying out completely inbred matings, as in animal breeding experiments. Other animals can swap between sexual and asexual reproduction, switching between bearing an ordinary offspring or an “MZ twin” of themselves (which leads into the question of what is the value of randomization in reproduction). When I think of randomness, I think of a source of extra variance in a population – I can usually only know of it by the distribution of its effect over multiple observations.

In the figure below, one of the most famous in the history of statistics, Sir Francis Galton plots the stature of men versus their adult sons. The red dot is the mean height for each group, the black diagonal line would be where the points would fall for MZ twins with h2=1, the red line is the “line of best of best fit” predicting son from father, and the angle between red and black measures proportionate to half the heritability, where the “other half” is the random segregation variance.Father and son heights from Galton 1886

Back to the ANOVA viewpoint

The machinery of ANOVA underlies population and evolutionary genetics, with its emphasis on populations of organisms and of genotypes. For example, in the univariate Breeder’s Equation, the heritability of the trait, h2, determines the (short term) response to artificial or natural selection:

R = h2 S

The S is the selection coefficient, which is the difference in mean of the selected trait between the current total population and the proportion that will be mated to produce the next generation. The R is the difference in mean trait value between the resulting next generation and current total population. You can see from the Figure why it is so simple [13]. There is no mention of the identity and number of genes actually involved in the causation of the trait – in fact two standard mathematical models assume either one gene, or an infinite number of genes each of tiny effect. One does needs a sample or population to estimate the heritability, but once estimated it can be applied to a single mating or parent-offspring pair from that population to predict the most likely offspring trait value. In the completely inbred case, there will be no response to selection, because any differences between individual trait values in the parental generation are completely environmental, that is, h2=0.

How does this fit in with with Lewontin’s points? In artificial selection, the goal-directed choice of inputs that drives the change in population trait mean is the selection process itself ie the partitioning of the population into those allowed to breed. However, this leads to zero effect unless the heritability is non-zero, that is unless the mechanics of meoisis transmit the parental phenotype to the offspring in the correct fashion. Famously, there are lots of methods of parent-offspring transmission [14] that will not lead to successful selection.

So in one sense, in the absence of a selective pressure, the heritability is a potentiality of that trait in that population with this particular array of environmental causes and distribution of genotypes. This sounds fairly abstract, and perhaps not a good candidate for being an entity directly relevant to causation. And since h2 is a ratio, the input of other causes, again say diet for height, can lead to shifts in response to selection of the population.

But in another sense, for any one individual, the heritability is a measure over the counterfactuals of how you could have developed conditional on having exactly the same set of genotypes you have, but in a different environment selected from the range of those represented in your current population. So if heritability is high, then the current range of environments doesn’t buffet this trait around much, again as a ratio comparing it to the total variance.

The interventionist theory of causation (eg Woodward 2003 [15]) has it that an entity is a direct cause if manipulation of it leads to alteration in downstream traits. In a third sense, my description of a high heritability trait above is that the effects of manipulation of the relevant genotypes are larger than those of manipulation of the environment, provided – again! – that those manipulations are of the same size as those previously seen in the relevant population. At the molecular genetic level, causation is a low bar, in that there are suggestions that every gene can affect every trait, and so one measure of causal importance is to scale this relative to the population trait variance.

What do we think of Lewontin’s criticisms overall? A “naive” hereditarianism holds that a high heritability trait is preordained from birth, that is, that modifications of the environment will not lead to great changes in the value of that trait. This was the conclusion that Jensen had come to regarding investment in education [16] that Lewontin was, in part, reacting to.

Now the first caveat we have seen is that environment-based interventions could be much larger than those seen on average in the population. One point of view in the “IQ wars” was that education being offered was grossly suboptimal, especially for disadvantaged kids. A simple ANOVA model might predict that decreasing the variation in quality of education will actually increase heritability, as the contribution from genes to the ratio will be relatively larger.

But our models need to be as complex as reality, precisely in that high IQ parents will offer better education to offspring, so our simple assumptions of “no interaction” are incorrect. In Kong and coworkers’ (2018) [17] study that measured genotypes at 120 genes to predict a person’s education level, one-quarter of the correlation between parent and child educational attainment was mediated by the alleles not transmitted during segregation. That is, parental educational achievement was predictable from genes (to some extent), higher parental education achievement leads to higher parental social position, so that so a proportion of the apparent heritability of the educational achievement in the children is indirectly mediated by parental social position, or rearing practices that track social position.

Is it incorrect to call the combined contributions of parental genes to the child’s educational outcome (direct and indirect) “genetic”? It is if one is trying to predict the size of response to an environmental intervention to the children (such as Head Start) [18]. Or if one is making a political point about differences in attainment by school when thinking about social class and teacher pay rates. But the label may be less important than the causal pathway the Kong et al genetic analysis reveals – it suggests environmental interventions improving parental education should lead to improvements for the children in the same way a far less possible genetic change would. Like many claims in the social sciences, this might seem pretty obvious, but in fact we don’t see such indirect effects for many other similar traits.

So do nonexperimental studies tell us anything about causation? I think definitely. Nobody doubts what Galton’s graph of parent and offspring heights is telling us about the thousands of causative factors controlling human stature. Are such physical traits different in terms of complexity of causation, from say human behaviour, or weather? The ANOVA view of the world is that such a multiplicity is not a difference in kind.

David L. Duffy is a research scientist who works on the statistical genetics and genetic epidemiology of traits ranging from cancer to personality.  As a result, he feels qualified to have an opinion on everything. He practiced medicine sometime back in the previous millennium, and has read far too much science fiction. You can see lists of publications and other stuff (even a couple of pastels) at, and some of what he’s been reading (or doing) lately at


[1] Weber M (2005). Genes, causation and intentionality. History and philosophy of the life sciences. Jan 1:407-20.

Kitcher P (2001). Battling the Undead. How (and How Not) to Resist Genetic Determinism. In: Singh R., Krimbas C., Paul D.B. and Beatty J. (ed.), Thinking About Evolution: Historical, Philosophical and Political Perspectives, Cambridge: Cambridge University Press, 396-414.

[2] Lewontin RC (1974). The analysis of variance and the analysis of causes. Am J Hum Genet 26: 400–411.

[3] Jensen AR (1969). How much can we boost IQ and scholastic achievement? Harvard Educ Rev 39:1-123.

[4] Bourget D, Chalmers D (editors). PhilPapers. Accessed 2019-May-29.

[5] Northcott R (2006). Causal efficacy and the analysis of variance. Biology and Philosophy 21:253–276.

[6] Rubin DB (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. J Educational Psych 66:688-701.

[7] Holland PW (1986). Statistics and causal inference. J Am Statist Assoc 81: 945-60.

[8] Holland PW, Glymour C, Granger C. Statistics and causal inference. ETS Research Report Series. 1985 Dec;1985(2):i-72.

[9] Likelihood based methods are those most closely related to information theory and Bayesianism about knowledge. Least-squares and moment-based approaches make fewer assumptions about distribution, but overlap with so-called pseudolikelihood methods.

[10] Wikipedia article on “Variance”.

[11] Usually, we use ordinary Euclidean distances, but all kinds of other ways of measuring distance can be applied, eg in kernel Hilbert spaces. The ambitious figure from Szekey & Rizzo (2017) actually refers to an approach that utilizes distances between distances. Sze´kely GJ, Rizzo ML. The Energy of Data. Annu. Rev. Stat. Appl. 2017. 4:447–79.

[12] Bann D, et al (2018). Socioeconomic inequalities in childhood and adolescent body-mass index, weight, and height from 1953 to 2015: an analysis of four longitudinal, observational, British birth cohort studies. The Lancet Public Health 3:e194-e203.

[13] Because of the absence of specification about the genes involved (other than broadly), the breeder’s equation is so general that it applies not only to simulations of biology, but also to solving abstract mathematical models where one has the goal of maximising a figure of merit – genetic algorithms. In these, the genes are many rules or partial solutions to a problem, where we wish to combine these into a single best solution. Interested readers can also check out the very abstract information geometrical interpretations of Fisher’s Fundamental Theorem of Natural Selection (which explains why there is always more genetic variance to select upon).

[14] For example transgenerational epigenetic inheritance probably only lasts a few generations in the absence of persisting environmental selection. As it happens, presence and extent of epigenetic inheritance is under genetic control.

[15] Woodward, J. (2003), Making Things Happen. New York: Oxford University Press.

[16] Jensen AR (1969). How much can we boost IQ and scholastic achievement? Harvard Educ Rev 39:1-123.

[17] Kong A, Thorleifsson G, Frigge ML, Vilhjalmsson BJ, Young AI, Thorgeirsson TE, Benonisdottir S, Oddsson A, Halldorsson BV, Masson G, Gudbjartsson DF. The nature of nurture: Effects of parental genotypes. Science. 2018 Jan 26;359(6374):424-8.

This is a genome-wide association analysis, and does not rely on some of the assumptions that one has to make when making inferences from say a twin study or descriptive family study.

[18] cf.


5 responses to “Analysis of Variance and Causation”

  1. Peter Smith

    David, this is lovely stuff. You have prompted me to pull down from my bookshelf my ancient copy of ‘Multivariate Data Analysis‘ by Cooley and Lohnes (

    Working from this book I used factor analysis to identify the principal orthogonal components in a large set of data gathered from a nationwide survey of vehicle owners’ perception of quality.

    The questions were poorly formulated and revealed confusing answers. I was convinced that the dataset contained valuable insights. Factor analysis was the answer(thanks to Cooley and Lohnes) and revealed the principal ways(principal components) in which customers perceived the quality of their vehicles. I was then able to measure the standing of each manufacturer in each of these orthogonal components. This study proved to be transformative in the way we understood how customers understood quality and how they perceived us, the manufacturers.

    At the time, I, like everyone else, took causality for granted. It existed and ruled our world. Since then I have questioned the concept, and always came back to the same answer. It exists, is all powerful and all encompassing. Now I ask why this should be so? This has to be the most perplexing question that we face.

  2. davidlduffy

    Factor analysis is an interesting case to discuss, in that there are an “infinite” number of equivalent factor models (ie fitting the data equally well) for any one dataset. One usually rotates the factors until they “make sense” in terms of what you already know about the topic. Often the first principal component (in a PCA), the one with the greatest R2 (“proportion of variance explained”) is not very illuminating until seen in the context of the next two or three, when for
    example, you realize it corresponds with the direction people migrated into Europe 50000 years ago. Or in your example, the first PC could be a combination of decreasing price versus increasing quality – a 45 degree rotation in this made-up example might give separate “price” and “quality” dimensions.

    To make factor analysis say something about causation, you need to expand it, as in Path Analysis aka Structural Equation Modelling. This was first invented by the great geneticist Sewall Wright, who used it to model pork belly futures in 1918, but mainly used it for genetic purposes. The factors in most of Wright’s examples are unobserved genes lying behind observed traits.

  3. Chris Stephens

    I don’t know if this says something about the PHIL PAPERS search engine, but you’ve missed some of the best philosophical discussions of ANOVA. For example, Elliott Sober’s paper “Separating Nature and Nurture” (maybe doesn’t show up because it appeared in a book on Genetics and Criminal Behavior?) explains things admirably. (He also has some introductory remarks in his Philosophy of Biology textbook, ch. 7).

    There is also Arthur Fine’s paper from 1990 Midwest Studies in Philosophy “Causes of Variability: disentangling nature and nurture”

    I didn’t see any of this show up in the phil papers list.

  4. Chris Stephens

    I forgot: Sober’s paper “Apportioning Causal Responsibility” from JPhil is also about understanding causes and ANOVA. It is from Journal of Philosophy 1988. Also didn’t see this in the Phil papers search.

  5. davidlduffy

    It is, but ISTM doesn’t advance greatly over the “standard interpretation” used by statisticians. It would not be surprising that Sober knows all about this stuff, given his coauthoring with Wilson arguing for group selection in evolution. The latter very much requires careful thought about causation and these types of models (and you would know that most evolutionary biologists think the claim there is such a thing as group selection is specious). There is an interesting article by James MacLaurin, “The Resurrection of Innateness” (the Phil Papers search also misses this) that also uses the Lewontin paper to organize these matters around – he attempts to differentiate between causal analysis and informational analysis, but I’m not sure if this is too helpful.

    This paper
    is quite nice – by biologists – and discusses what the authors dub “the differential view” ie what I have tried to emphasize. I have worked on several of their example phenotypes.

    I think the problem I have with the Lewontin paper is that (like similar papers written in response to “The Bell Curve”) it debunks too much. So, one is allowed to use these methods on, say, bovine milk production, human height and schizophrenia, and decide that the genes that we now know about via genome-wide genotyping are “real” causes that are consonant with family or twin based analyses, but not on, say, personality or church attendance (the latter turns out to have low heritability but strong effects of family environment, BTW ;)). The rise of Mendelian Randomization analysis
    by epidemiologists is uncontroversial it seems.