Before looking at any numbers, it's important to bear in mind that the forecasts are probabilistic, with the predictand being the probability that the temperature (or precipitation) will fall within the lower, middle, or upper tercile of the reference distribution. The reference period is currently 1981-2010, but prior to 2011 it was 1971-2000. Probability forecasts are a bit less intuitive to the average user, and measuring the skill is a more subtle problem, but probabilities are really much more useful than deterministic forecasts for decision-making and risk management.

One simple method to examine the skill of the forecasts is to find the highest of the three tercile probabilities in each forecast, and then compare that to the observed outcome. For example, a forecast of 20%-30%-50% for below-normal-above indicates that the above-normal category is more likely than either of the other two. Note that CPC forecasts nearly always predict 33% for the near-normal category, so a more realistic example would be 17-33-50. After proceeding through 10 years of forecasts, issued once per month for the closest 3-month season (i.e. "0.5 month lead" in CPC terminology), I calculated the following contingency tables:

The Fairbanks table, for example, shows that the forecast below-normal probability was highest on 4+5=9 occasions from 2005-2014, and 4 of these verified as below-normal, while 5 verified as near-normal. Considerably more forecasts (36) had the above-normal category as most likely, and 20 of these verified as above-normal, while just 4 turned out to be below-normal. On the whole we can see that the forecasts have a good ability to discriminate between outcomes: only 4 out of 45 forecasts were completely on the wrong side in Fairbanks. (According to a chi-squared test, the skill is significant at better than the 99.9% level. Note that I don't have a row for instances where the near-normal probability was highest, because this rarely or never happens in the CPC forecasts. Also note that 75 out of 120 forecasts showed equal probabilities for all three terciles and therefore were excluded from the analysis, i.e. the CPC did not have an opinion one way or the other more than 60% of the time.)

Based on the contingency tables, the performance of the Anchorage forecasts looks similar to the Fairbanks forecasts and the level of success is fairly impressive (the significance level is again very high). However, we can see that more forecasts were made for Anchorage (72 out of 120 months), especially on the cold side; this is because confidence is usually higher for seasonal forecasts in southern Alaska.

The Barrow forecasts are remarkable because not once did CPC indicate that below-normal was the most likely tercile, and out of 67 forecasts for above-normal, only one was decidedly wrong (summer 2014). From this standpoint the forecasts are very successful.

Another way of looking at the forecast performance is to create a scatterplot of forecast probability versus observed outcome. See below for these charts. I chose to plot the above-normal probability on the x-axis, and the y-axis shows the observed temperature anomaly (in standard deviations so that different times of the year are directly comparable). I've added a least-squares regression line (red) for all the non-33% forecasts, and the thick gray horizontal lines show the theoretical tercile boundaries for a Gaussian distribution.

We can see a nice positive slope for the regression line in Fairbanks and Anchorage, which means that the higher the forecast probability of above-normal, the warmer the outcome is found to be on average - as we expect. We can also see that for Anchorage the regression line intersects the x-axis right at 33%, which is what it should do in theory. However, if the forecasts were perfectly calibrated then the line would cross into the upper tercile at 50%, whereas in fact it's more like 42%. This suggests to me that the Anchorage forecasts are actually

*under-confident*: the CPC should be saying 60% or 70% probability of above-normal when it's only saying 50%. I'll have to explore this idea more in future with a larger set of stations.

The result for Barrow is again very interesting as it appears the forecasts have little or no ability to discriminate between small and large deviations from normal. Strictly speaking, probability forecasts are not designed to do that, but successful forecasts usually do. In this case, we find that the temperature ended up in the above-normal tercile 76% of the time whenever CPC predicted above-normal; but when CPC made no forecast, the temperature was still above-normal 72% of the time. I conclude that although the Barrow forecasts are very successful from a probability standpoint, the majority of the skill is coming from the long-term warming trend. In other words, if CPC predicted above-normal every single time, the success rate would be nearly as good.

## No comments:

## Post a Comment