One approach I've been toying with to make these kinds of estimates is with the use of quantile regression. Quantile regression is something of cousin to the more familiar least-squares regression, but is computationally more tedious, so was not much utilized until the advent of modern computing. Nowadays, it's trivially simple to use on the kinds of climate datasets that I mostly work with, that is, point-based time series. So the first question you ask: what is a quantile? A quantile is, to quote Wikipedia, "…cutpoints dividing the range of a probability distribution into contiguous intervals…". Quantiles can have any value between zero and one. So, the 0.5 quantile divides a distribution into two equal sizes: half the values are above and half the values are below. You've heard of this: it's better known as the median. A quantile of 0.843 divides a distribution into two parts: the quantile is the value of the distribution for which 84.3% of the distribution is below and 15.7% above. Quantile regression is a method to estimate the quantile values of a dataset when one variable is (possibly) dependent on one or more other variables. The second question you ask: why would you want to use quantile regression? There are a couple of reasons. First quantile regression is not nearly as sensitive to outliers as ordinary linear regression, which in effect models the mean. Secondly, and most significantly for my purposes here, quantile regression allows us to generate estimates of not only the central values of a distribution, e.g. mean or median, but also allows for estimates of how other aspects of the distribution are (possibly) changing.
As an example of this approach, below is a plot of some climate data that you are probably familiar with: spring breakup dates of Tanana River at Nenana (for this version I've used "fractional dates" which incorporate the time of breakup, which does not matter to this analysis). There is no statistically significant trend through into the 1960s, so I construct the quantile regression to have zero slope in this time period. The purple line is the segmented median (0.50 quantile) date of breakup, which in this case we're looking at the dependence of breakup date on the year (i.e. the trend). The green-shaded area represents the area between the 0.333 and 0.666 quantiles. So, this plot should partition the breakup dates into three (roughly) equal categories: one-third below the green shading (significantly early break-ups), one-third inside the green shading (near normal) and one-third above (significantly later than normal). From this, it's easy to see that break-up dates during the first days in May in the mid-20th century were solidly in the "significantly earlier than normal" category, but the same dates are now in the "significantly later than normal" category.
The quantile regression I've presented here allows us to make reasonable estimates of the current distribution of some climate variables in the face of change. This simple linear approach is not likely to be sufficient in the future. For instance, in looking at the Tanana at Nenana breakup dates, I suspect that we are starting to (or will be soon) butt up against astronomical constraints on how early breakup can be given expected terrestrial climate forcing in the next century; e.g. a solar noon sun angle of 30ยบ above the horizon (Nenana on April 1) can only do so much heating. In that scenario, well need to employ non-linear techniques. But that's a topic for another day.
___________________________
Updated to respond to Richard's comments and questions of Aug 21.
Here's a plot of the quantile regresion slope at 0.05 increments and the associated confidence intervals (90% level) for the Alaska statewide late winter (JFM) temperatures (data plotted above). In this case both the tails show higher spread in the confidence intervals than most of the middle, which I would expect. One wonders though what's going on at the 0.60 and 0.65 quantiles.
Here is some data with more a problematic structure. This is over a century of first autumn freeze dates at the Experiment Farm at UAF. I've included the segmented median and the "near normal" category (0.333 to 0.666 quantiles):
If we push it out even further and make it even more fine grained (quantiles 0.02 to 0.98 every 0.01) more artifacts emerge, such as the occasional spikes in the bounds, and then the impossibly small confidence interval above the 95th percentile. For me the moral of this story is that it's important to do this exploratory review first, especially if the focus is in the far extremes of the distributions, where potentially other tools are better suited.
Very interesting, Rick. I've used quantile regression in a few contexts but this discussion is helpful.
ReplyDeleteI'm curious as to how quickly the confidence intervals widen as you run the regression farther out in the tails; for example, presumably tercile regression estimates are quite stable but decile estimates much less so (e.g. if you remove one or more outliers). I wonder if any general statements can be made about this for a typical ~100 year climate record.
Have you tried this approach on any highly non-Gaussian data?
Richard, I've added a section with some more info on the quantile regression and the confidence intervals. Based on my limited experience, analyses of the tails are strongly dependent on the detailed structure of the distribution.
DeleteRick, the additional plots are interesting and revealing - thanks. It's curious that both of these examples have higher uncertainty on the low end than on the high end, I don't suppose that can be generally true.
DeleteI imagine that one would need to see a "significant" slope in these quantile-dependent regression slopes to be able to claim that asymmetric changes have occurred in the opposite tails. In the JFM example, while the confidence interval of the lower quantiles crosses zero, it is also wider than at higher quantiles, so the slope could also plausibly be greater than the median, no?
Richard, I believe your quite right. More basically, I should have tested for changing variance (e.g. F-test on independent subsets of the data or a Breusch-Pagan test of the full data). Either way, no evidence to support the notion of changing variance. Thanks for the corrective to my thinking.
Delete"This was underlying the many comments I heard about how cold the winter of 2016-17 was in Alaska. Of course, through the multi-decade lens, it wasn't notably cold for the winter (through parts of the state were, by any measure, cold in March). So that got me to thinking: given that many climate variables in Alaska are changing, how can we provide estimates of "normal" and associated variability that takes into account the ongoing changes?"
ReplyDeleteI suggest a new measure that devolves from statistics...BTU's of fuel consumed to heat a dwelling per winter by location. Ours was up in the valley floor. Perhaps at elevation above the valley inversion it was not?
Stat's are fun but $ measures.
Gary
You bet Gary, dollars are where the rubber meets the road, so to speak. BTUs of fuel consumed is strongly dependent on all kinds of things beside temperatures: how well insulated is a building. What is the target temperature for indoor heat. How often and for how long are doors kept open? Degree days are one common way to capture the outside temperature part. Otherwise, your mileage may vary.
DeleteThanks for this analysis Rick. Good stuff.
DeleteI'll dig out our heating bills for the last few years. Same house, same Toyo stoves, same internal temps. I'm just curious what the changes were in annual (May-May) gallons consumed. But yes the degree days are best.
I suspect it wasn't from deep cold as much as prolonged at moderately low temps last winter, but...?
Gary
Winter 2016-17 was certainly colder than the previous three in Fairbanks-land (yeah). But not notably cold in a longer (multi-decade) view. March though was definitely one to write home about.
Delete