Randomness poses a threat to many statistical analysis projects and it is therefore vital to understand the impact of randomness to ensure accurate and actionable conclusions are made. Regression to the Mean (RTM) can help to measure the effect of randomness on a dataset which in turn will help understand the extent to which meaningful conclusions may be drawn. In answering the double header question in the title of this blog we will mainly be considering the application of RTM to fusions although the technique may also be utilised to assess other modelling solutions.
So “What is Regression to the Mean” (RTM)? If you Google the term or look it up on Wikipedia you will get a definition that, although academically correct, is not the easiest to interpret and doesn’t clearly explain how it relates to assessing performance of fused datasets.
Within RSMB, RTM is often used as a tool for evaluating the performance of data fusions. It determines whether the fused data preserves the real differentiation seen for a classification of interest (e.g. a specific media behaviour) amongst a key variable (e.g. a particular demographic group, other media behaviour) where there are marked differences for that classification. The method compares the results for single source data (the control group) against the fused dataset for any questions of interest and provides a measurement for the fusion which can be used as a validation method. This is an essential part of the evaluation of the method as it demonstrates the performance of the fusion.
The following hypothetical example shows how Regression to the Mean can be used to evaluate a fusion.
Here the index of the percentage of 16-24 year olds on the All Adults percentage for the real respondents shows that 16-24s are 4 times more likely to use Social Media platforms daily than All Adults – a clear difference in behaviour.
The same calculation for the fused data gives a multiple of 3 times; this reduction in the index shows that the fused data has a reduced level of differentiation for this category.
The Regression to the Mean is calculated by how much the original index has moved towards 100; were the fused Adults 16-24 usage percentage equal to the All Adults percentage then all differentiation would be lost and the fusion would effectively be no better than having randomly matched datasets. For this example the Regression to the Mean is 33% (i.e. (400-300)/(400-100) %) or, looking at it with a more positive spin, 67% of the original high discrimination has been preserved.
The media industry is able to use Regression to the Mean as a tool for evaluating the performance of a model or fusion; it is a useful justification to show that the fusion/model has retained an acceptable amount of differentiation and is usable. Without the Regression to the Mean calculation it would have been harder to quantify the reliability of the fused data.
Failing to take Regression to the Mean into account can lead to misconceptions and incorrect decisions. For a fusion, for example, if the level of Regression to the Mean is fairly significant it may make the fusion trivial – the key differentiation would be lost and the fused data would be misleading and not much better than a random fusing of two datasets.
It should be noted that a regression to the mean analysis may be limited, or even not possible, depending on what is being evaluated. It was possible to calculate the RTM in the example above as it was analysing the preservation of the relationship between demographic and the variable that was being fused. However if it had instead been a fusion between two media surveys, and the aim was to analyse the cross media relationship in the fusion, there may have been limited true single source data to benchmark against. Of course, obtaining this data might be the reason for undertaking the fusion in the first place!
In our view, Regression to the Mean is an important indicator for the media industry to assess the performance of fusions and models where it is possible to do so.