**Warning**: Use of undefined constant EWF_PROJECTS_TAX_SERVICES - assumed 'EWF_PROJECTS_TAX_SERVICES' (this will throw an Error in a future version of PHP) in

**/home/rsmb7064/public_html/wp-content/themes/rsmbv2/admin/functions-layout.php**on line

**382**

Viewing posts categorised under: **Uncategorized**

### What is Regression to the Mean and why it is still important in media research?

Posted on: **January 4, 2021** | 0 Comments

by Sam Stratford.

Randomness poses a threat to many statistical analysis projects and it is therefore vital to understand the impact of randomness to ensure accurate and actionable conclusions are made. Regression to the Mean (RTM) can help to measure the effect of randomness on a dataset which in turn will help understand the extent to which meaningful conclusions may be drawn. In answering the double header question in the title of this blog we will mainly be considering the application of RTM to fusions although the technique may also be utilised to assess other modelling solutions.

### What is Regression to the Mean?

So “What is Regression to the Mean”? If you Google the term or look it up on Wikipedia you will get a definition that, although academically correct, is not the easiest to interpret and doesn’t clearly explain how it relates to assessing performance of fused datasets.

Within RSMB, RTM is often used as a tool for evaluating the performance of data fusions. It determines whether the fused data preserves the real differentiation seen for a classification of interest (e.g. a specific media behaviour) amongst a key variable (e.g. a particular demographic group, other media behaviour) where there are marked differences for that classification. The method compares the results for single source data (the control group) against the fused dataset for any questions of interest and provides a measurement for the fusion which can be used as a validation method. This is an essential part of the evaluation of the method as it demonstrates the performance of the fusion.

### Calculating Regression to the Mean

The following hypothetical example shows how Regression to the Mean can be used to evaluate a fusion.

*Do you use Social Media Platforms daily?*

| Single source | Fusion |

All Adults | 5% | 5% |

16-24s | 20% | 15% |

Index | 400 | 300 |

Here the index of the percentage of 16-24 year olds on the All Adults percentage for the real respondents shows that 16-24s are 4 times more likely to use Social Media platforms daily than All Adults - a clear difference in behaviour.

The same calculation for the fused data gives a multiple of 3 times; this reduction in the index shows that the fused data has a reduced level of differentiation for this category.

The Regression to the Mean is calculated by how much the original index has moved towards 100; were the fused Adults 16-24 usage percentage equal to the All Adults percentage then all differentiation would be lost and the fusion would effectively be no better than having randomly matched datasets. For this example the Regression to the Mean is 33% (i.e. (400-300)/(400-100) %) or, looking at it with a more positive spin, 67% of the original high discrimination has been preserved.

### When is Regression to the Mean used

The media industry is able to use Regression to the Mean as a tool for evaluating the performance of a model or fusion; it is a useful justification to show that the fusion/model has retained an acceptable amount of differentiation and is usable. Without the Regression to the Mean calculation it would have been harder to quantify the reliability of the fused data.

Failing to take Regression to the Mean into account can lead to misconceptions and incorrect decisions. For a fusion, for example, if the level of Regression to the Mean is fairly significant it may make the fusion trivial - the key differentiation would be lost and the fused data would be misleading and not much better than a random fusing of two datasets.

It should be noted that a regression to the mean analysis may be limited, or even not possible, depending on what is being evaluated. It was possible to calculate the RTM in the example above as it was analysing the preservation of the relationship between demographic and the variable that was being fused. However if it had instead been a fusion between two media surveys, and the aim was to analyse the cross media relationship in the fusion, there may have been limited true single source data to benchmark against. Of course, obtaining this data might be the reason for undertaking the fusion in the first place!

In our view, Regression to the Mean is an important indicator for the media industry to assess the performance of fusions and models where it is possible to do so.

### Creating a hybrid measurement model by incorporating return path data (RPD) in TAM Audience Ratings.

Posted on: **June 9, 2020** | 0 Comments

In this blog we’ll look at how return path data (for example data collected from a Virgin or Sky set top box) could be used to enhance viewing data from a

TV audience measurement (TAM) panel such as BARB. For clarity we’ll call the enhanced system a hybrid model.

Why might you want to incorporate RPD? Whilst TAM meter panels do a good job of measuring viewing overall, sampling error will increase as content being analysed becomes more marginal and granular; for a small channel on a narrow target audience at the programme or spot level the sampling error on published audiences may large. Whilst the panel may show zero ratings for a channel at a particular time there are likely to be some people in the population as a whole watching it. The argument goes that incorporating RPD data, either in its entirety or by way of a large sample of RPD homes, would increase the precision of the published data and certainly address the problem of zero ratings where audiences exist.

RSMB exists to find statistical solutions to media measurement challenges and, as you would expect, we have undertaken a lot of theoretical statistical analysis on this.

To successfully incorporate RPD into a measurement currency you need to crack three problems:

- RPD tells you what a set top box (or connected device) was doing but it doesn’t tell you if anyone was watching. This is relatively easily addressed with a capping algorithm to truncate long viewing sessions.
- Assuming viewing was taking place, RPD data does not tell you who was viewing so you need to model that.
- Having cracked problems (1) and (2) you need to be able to produce data files for analysis that are consistent. In the case of the UK, analysis produced from Database 1 (panel respondent level data to support applications like reach and frequency) needs to be consistent with Database 2 (processed data of minute by minute, programme and commercial ratings). This topic isn’t covered here.

There are various ways of tackling problem (2) and our work shows that the approach taken is the biggest determinant of the extent to which a hybrid model reduces sampling error compared to a meter panel alone.

It goes without saying that there will be assumptions of statistical independence somewhere in the model which may compromise accuracy; this must be traded-off against the gains in precision required for a usable currency. Everyone would agree that a hybrid model is not worth doing if we don’t actually get any gains in precision from incorporating RPD data.

Until it is available ubiquitous, RPD covers a subset of all viewing: one or more platforms (eg Sky) and/or devices (eg Samsung TVs). For this subset there are two components to the variance: the sampling variance in homes viewing and the sampling variance in people per home viewing – and the two effects may counter each other.

The simplest approach to converting STB data to demographic data is to use the meter panel to create “Viewers per View” for each reporting demographic for a viewing event and apply that to the set top box (STB) homes audience. However this may result in greater sampling variability than the meter data alone. Whilst a large RPD sample will reduce the variance in homes viewing, the absence of person demographics actually increases the variance in people per home and, overall sampling error can increase. The more targeted the demographic, the larger the sampling error.

That variability may be reduced by stratifying the sample. For example, if the reporting demographic is 16-24s, sampling error may be reduced by excluding STB homes that don’t include 16-24 year olds. To do that, access to household composition data is needed. If using all available STB (census) data then this could be from sign-up data, but more likely the STB data will be from a large recruited sample of STB homes and demographics will be collected as part of recruitment.

Another factor to consider is the effect of STB data, which covers only a subset of platforms and/or devices, on the variability of measured audiences as a whole. Perhaps counterintuitively, it’s not the case that measuring some platforms more accurately necessarily reduces sampling error across the whole hybrid model: it can actually make it worse. For the single source TAM measurement panel, the sum of viewing across all platforms is more robust than the separate platforms. At an extreme, people can have similar overall viewing levels but very different platform shares. Then if we correct platform A up or down but don’t make a compensating correction for platform B, the sum of platform A plus platform B is destabilised.

Steve Wilcox presented our findings from a hybrid model experiment using BARB data at the 2018 ASI conference. This compared the benefits of a 25,000 RPD “boost” with person demographics with the benefits of a census RPD boost with no person demographics. For each there were alternative models; one where the RPD included all platforms and one where it was for a single platform covering 40% of all viewing. Three types of channel were considered: a “large” channel with an average TVR of 2.81, a “medium” sized channel with an ave TVR of 0.17 and a “small” channel with an ave TVR of 0.03. The large and medium channels were cross-platform, the small channel was predominantly viewed on the single platform. There is too much detail to go into here, but in summary:

- the 25,000 with demographics boost covering all platforms increased effective sample size for all Adults significantly for small, medium and large channels. The effective sample size was at least 3 times higher than without the RPD boost. As we know the demographics of people living in RPD homes, these increases in effective sample size are reasonably consistent for all demographic target audiences.

- However where the 25,000 with demographics boost covered only the single platform, there were only marginal improvements to the effective sample sizes of the large and medium sized channels. This is because the sampling error for these channels is still dominated by the component of the audience on platforms only measured by the TAM panel. The small channel did see a significant increase in its effective sample size (nearly by 4 times, which again would be similar for subdemographic audiences) because it’s viewing is almost exclusive to the RPD boosted platform.

- the model that included census RPD data for all platforms (ie not a sample of RPD homes, but the entire dataset and across all platforms) saw substantial increase in effective sample sizes for the All Adults demographic: multiples of 9 times for the large and medium sized channels and 11 times for the small one. However, when you looked at subdemographics a very different picture emerged. For ABC1s, effective sample sizes didn’t improve for the large and medium sized channels although there was a slight improvement for the small channel. However for Men 16-34 the effective sample size actually fell compared to the TAM panel alone and this is because of the loss of the ability to stratify the hybrid model by household demographic composition.

- finally, the model for census RPD data for the single platform worsened the effective sample size for All Adults for the large and medium sized channels. It did improve it for the small channel by a multiple of 3, again because the channel is predominantly distributed on this platform.

So, perhaps against initial expectations, a large sample boost with demographics actually produces more precise results than a boost using census RPD data where demographics aren’t available.

It’s also worth noting that RPD data would not necessarily improve demographic precision for very small channels that the TAM panel struggles to measure with precision.

So does RSMB think RPD has a place in a hybrid model? Yes we do, but the model needs to be carefully constructed. We believe there are strong arguments for a well-managed sample boost of RPD homes with person demographics, certainly were there ever to be RPD data that covers all platforms. Such a boost may enhance sample sizes for all channels and subdemographics, with benefits more limited to small channels if the boost is limited to a single platform. That is not to say that census data can’t have a value: demographic data may be collected on sign-up or alternatively it may be used for small share platforms in a hybrid cross-media measurement model where its impact on overall sampling error will be limited.

As more and more RPD data is becoming available, not just from set top boxes but also connected televisions and viewing on mobile devices, RPD data is likely to become an integral part of future measurement solutions.

If you would like to find out more, please contact us at contact@rsmb.co.uk.