A brief introduction to robust statistics

Hi! This video is a 10 minute introduction to
the basics of robust statistics. We will start by looking at estimating the
location and correlation parameters to see how robust estimators behave relative to their
classical non-robust counterparts. I will then discuss some of my own research
into estimating a sparse precision matrix in the presence of cellwise contamination. At the end are some references you can use
to find out more about the various techniques presented. So, let’s get started. The aim of robust statistics is to develop
estimators that model the bulk of the data and are not unduly influenced by outlying
observations or observations that are not representative of the true underlying data
generating process. To explore this idea we consider the simple
case of location estimation. We will look at three estimators, the mean, the median
and the Hodges-Lehmann estimator of location which is just the median of the pairwise means. In this example we have 10 observations drawn
from a uniform distribution over the range 0 to 10. All three estimates of location start
off close to the true parameter value, which is 5 in this case. What we will do is observe
how they behave when we artificially corrupt some of the observations. We start by taking the largest observation
and moving it to the right. When we do this the mean also starts to increase which is
what you would expect from a non-robust estimator. In contrast, the Hodges-Lehmann estimator
and the median remain unchanged. It is this resilience to contamination that makes them
what we call robust estimators. If we take the two largest observations and
move them to the right, the median stays the same; the Hodges-Lehmann estimate jumps from
5 to 5.5 but then stays constant and the mean reacts as before by increasing as the contaminated
observations increase. We observe similar behaviour when the three
largest observations are contaminated. When the four largest observations are contaminated
the Hodges-Lehmann estimate now behaves just like the mean, in that, both estimators are
in breakdown, that is to say they are no longer representative of the bulk of the data. And when five observations are contaminated,
even the median is no longer sure which observations represent the original data and which are
the contaminated observations. So it also returns an estimate that is no longer representative
of the location parameter of the original data generating process. We can observe similar behaviour in multivariate
data sets. In this example we are looking to estimate the correlation between variable.s In the lower half of the matrix we have scatter
plots of 30 observations. For example this scatter plot shows variable 2 on the y-axis
and variable 1 on the x-axis. Whereas this one has variable 2 on the x-axis and variable
3 on the y-axis. In the upper half of the matrix we have the
true parameter values that I used to generate the data, the classical correlation estimates
and the robust correlation estimates between each of the pairs of variables. The robust estimator that I have used is the
MCD estimator and you can find links to further details at the end of the presentation. We are going to observe what happens to our
estimates when we take three observations and move them away from the main data cloud.
Let’s focus on the relationship between variable 1 and variable 2. As the observations move
further away from the data cloud, the classical correlation estimator is decreasing and eventually
it becomes zero suggesting that there is no correlation between variable 1 and variable
2. As the contamination moves further away, the classical estimate is now giving a negative
value suggesting a negative relationship between variable 1 and variable 2. In contrast, the robust estimates have all
stayed reasonably close to the true parameter values as they are modelling the core of the
data and are not as influenced by these outlying values. Let’s look at a real example. Some work I
do for the industry body Meat and Livestock Australia is concerned with predicting the
eating quality of beef. They run a large number of consumer tests,
where consumers are asked to rate pieces of meat on their tenderness, juicyness, flavour
and give an overall score. The aim is to develop a predictive model that
rates each cut of meat as 3, 4 or 5 star based on what we know about consumer preferences. Here’s a scatter plots of ratings for almost
3000 pieces of meat. There’s obvious structure in the data set,
with tenderness, juicyness, flavour and overall being highly positively related. However,
there is also a lot of noise, with some consumers giving very high scores for some variables,
and low scores for other variables. For example up here you have some observations
where consumers have given very high tenderness scores but very low overall scores. If you calculated classical covariances, you
can see there there is reasonably strong positively linear relationships between the variables,
however, robust techniques such as the Minimum Covariance Determinant can be used to highlight
the tightest core of the data, which represents the relationship between the variables for
the majority of consumers, which suggests that true underlying correlations between
variables are much stronger than the classical method would otherwise suggest. For example, the relationship between flavour
and overall goes from 0.86 to 0.99 when you restrict attention to the the tightest half
of the data. Something that I’ve looked at in my research
is estimation in the presence of cellwise contamination. Transitional robust techniques, such as the
Minimum Covariance Determinant we used earlier, assume that contamination happens within the
rows of a data set. Furthermore, even the most robust estimators
can only cope with at most 50% of the rows being contaminated before they no longer work
effectively (as we saw earlier with the median breaking down when there were 5 contaminated
and 5 uncontaminated observations). This assumption of row-wise contamination
may not be appropriate for large data sets. What we have here is a heat map of a data
matrix. The white rectangles represent corrupted cells. In this situation with a small amount
of cellwise corruption that affects less than half of the rows, classical robust methods
will still perform adequately, however as the proportion of scattered contamination
increases, or the contamination is allowed to spread over all the rows, then you might
end up in a situation where all observations have at least on variable that is contaminated. There is still a lot of “good” data in this
data set, the challenge is extracting information about the core of the data without your estimates
being overwhelmed by the contaminating cells. In particular, I have looked at estimating
precision matrices in the presence of cellwise contamination. A precision matrix is just
the inverse of the covariance matrix, however, often in large data sets, you also want to
assume that the precision matrix is sparse, that is there are a number of zero elements
in the precision matrix. This is particularly useful, when modelling Gaussian Markov random
fields where the zeros correspond to conditional independence between the variables. We’ll briefly apply these ideas to a real
data set to finish off. We have 452 stocks or shares over 1258 trading days. We’ll observe
the closing price which we convert to daily return series and we want to estimate a sparse
precision matrix, to identify clusters of stocks. That is, groups of stocks that behave
similarly. To do this we used a robust covariance matrix
as an input to the graphical lasso, details of which can be found in the references at
the end of the presentation. If we look at the return series for the first
6 stocks in the data set, we can see that there are a number of unusual observations
scattered throughout the data. For example you have this negative return for 3M Co, again
for Adobe and AMD also has a number of unusual observations. This suggests that perhaps there
is a need for a robust estimators when analysing this data set. We’re going to visualise the results as a
network of stocks. If a there is a non-zero entry in the estimated precision matrix between
two stocks, then they will both appear in the graph with a line linking them. Furthermore,
I have coloured the stocks by industry. We can see here in the classical approach
that we have some clusters of stocks. However, if we use the robust approach, these clusters
are much more densely populated. So it has identified linkages between more stocks than
the classical approach did, reflecting the fact that the robust method is not as influence
by those unusual outlying observations. So we’ve got a financials cluster over here.
We’ve got an information technology cluster down here. A utilities cluster and a energy
cluster. If we add some additional contamination, the
classical approach is no longer able to identify those clusters of stocks, it gives you just
this soup of linkages with no clear structure. The robust approach, on the other hand, gives
you essentially the same as what we had before when I didn’t add the extra contamination
in. So you’ve still got Information Technology clusters. You’ve still got a utilities cluster
over here and a financials cluster down there. So what this tells you is that the robust
methods are modelling the core of the data and are relatively unaffected by unusual or
extreme observations. So, that’s been an extremely brief introduction
into the idea of robust statistics. The take home message is that the robust methods
are designed to model the core of the data without being unduly influenced by outlying
observations that are not representative of the true data generating process. If you’d like to know more about any of the
methods presented here you can check out these links. For the Hodges-Lehmann estimator, the
Minimum Covariance Determinant, this is a nice recent review paper, or the graphical
lasso. Here’s some of my work. There’s a paper that’s
to appear in Computational Statistics and Data Analysis on the robust estimation of
precision matrices in cellwise contamination. To understand that, it would perhaps help
to go back to an earlier paper on robust scale estimation. Or you could check out my PhD
thesis which is available at that link there. If you would like to get in touch, there are
my details and you can find a copy of these slides at this link.

3 Replies to “A brief introduction to robust statistics”

Leave a Reply

Your email address will not be published. Required fields are marked *