What if data is unrealiable?

When we want to investigate an issue, the instinctive response can be to "look at the data". But what if the data raises more questions than it answers?
What if data is unrealiable?
Like

Out of everything I learnt in my first year at university, the most useful thing wasn't a fact or a theory. It was a mantra: "Keep it complex."

Those are of the words of Andy Stirling, an academic whose eponymous 2010 article in Nature touts the importance of accepting unknowns and uncertainty in research - especially when under pressure to offer a single, definitive interpretation.

When it comes to understanding an issue, the stock response is to "look at the data". But this assumes that all data is reliable and 100% accurate What if is isn't?

This is a problem I've had to think about during my research project on whether the price and country of origin of UK honey imports has any effect on British beekeepers and their bees.

The bulk of my statistical analysis has been comparing data on UK honey imports with data on British bee colonies and honey production - yet there are a number of flaws with both sets. For starters, they come from a number of different sources that sometimes contradict each other. There are also data gaps for some sets, anomalies in others, and a suspicious lack of change in some periods.

All of this can make analysing data seem pointless - how can we ever hope to draw conclusions on data if we have reservations about whether the data actually reflects reality?

But during my research project I've learnt about some handy methods for dealing with unreliable data, enabling useful analysis but also keeping things complex.

Averaging

It may sound obvious, but when your data comes from different overlapping sources, averaging is a huge help. For example, if you have three different values for one year, you can average them all into one figure and use this for your analysis, as well as the high and low ranges. This means that when I ran correlation tests later in my analysis, I could use the average value, low range value and high range value to assess the full possible range of possibilities.

If you're analysing data over a period of time, you can use averages further by creating three- and five-year averages for use in your analysis. Not only does this tend to reduce the risk of anomaly years drawing you into misleading conclusions, it also gives you a more consistent picture of trends over time.

Interpolation

Interpolation sounds complicated, but you're probably more familiar with it than you think.

In the time of Covid press conferences you probably heard scientists use the word "extrapolate" when predicting how infections numbers might run over the course of time. Extrapolation is using historical data to predict how that data might look in the future - provided it runs according to historical trends.

Interpolation is the opposite - using data we already have to fill out gaps in past data. That could be breaking annual data down into monthly data, or (in the case of my research) addressing a periodgap where no data exists.

Though it's an extremely helpful tool, it's important to remember that interpolation is only as accurate as your data is and only produces results in line with the existing trends. Luckily, there are a couple of ways to soften the impact of this.

Sensitivity testing

Again, this sounds complicated but is really just playing with numbers. Sensitivity testing essentially involves changing your data and seeing whether that affects the results.

Let's say you've found a correlation between two sets of data and you want to test how strong your findings are. What happens if you up the numbers in one dataset by five per cent? What happens if you decrease them by the same amount?

It's all in the name - sensitivity testing tests how "sensitive" your findings are. If your findings don't change much, it means the risk of you drawing misleading conclusions due to inaccurate data is smaller than if your findings do change.

It's easy to go down an endless rabbit hole with statistical analysis, which is why this final tool is vital for tackling unreliable data...

Look at the real world

Remember, quantiative analysis - drawing conclusions from statistics - is just one side of the research coin. The other side, qualitative analysis - looking into things that can't be measured mathematically - is just as important because it contextualises your data. Knowing the numbers is one thing, but knowing why the numbers are as they are makes your findings much more useful.

Luckily there are countless ways to do this: journal articles, archives, and interviews with those involved in your research topic.

"Keeping things complex" is often about giving everything its proper context.

Please sign in

If you are a registered user on Laidlaw Scholars Network, please sign in