Last Fall I drafted a post that displayed a measure of temperature extremeness that I developed using the Chi-Square Goodness-of-Fit test statistic. The test compares an actual distribution with an expected distribution. This is a lengthy follow up to that post. I have submitted a journal article describing the methodology that is still in the peer-review process.

If we consider the distribution of temperatures to be normally distributed, we can compare the standardized temperature anomalies with readily published z-score tables. For example, 68% of observations will be within 1 standard deviation of the mean if the data are normally distributed. The first chart below shows the distribution of daily temperatures in Fairbanks by standard deviation categories since 1930 (note: all standard deviation values are based on an adjustment to the published NCDC values that Richard developed last year). The year 1930 was chosen due to the oftentimes poor data quality prior to that year. The categories in the first chart below are 1/2 standard deviation wide and take into account the sign of the anomaly. The categories on the ends (over/under 1.5 standard deviations are grouped together so that those categories are not too small. Notice the bias toward temperature being above normal versus below normal. This is most likely a reflection of the upward trend in low temperatures over the years. The second chart groups the data together in whole standard deviation increments and without regard to sign.

**On both charts, the blue line is the expected value based on the normal distribution.**

**2013:**

So how did the distribution of temperatures in 2013 look? Using the same groupings as the two charts above, what clearly stands out is the large number of values over +/- 1.5 standard deviations from the mean in the first chart below and over +/- 2.0 standard deviations from the mean in the second chart below.

In fact, there were a remarkable 54 days where the daily average temperature was < -2 standard deviations or > +2 standard deviations from the daily mean. The year with the second greatest count of such days is 1992 with a total of 40. Using the normal distribution, the expected number of days in any give year to reach that threshold is 16.6. (Note: the y-axis in the charts above use 'Percent of Days' and the table below uses a raw count.)

**Chi-Square Goodness of Fit:**

There are many goodness-of-fit tests used in statistics. The benefit of using chi-square is that use can use ratio, interval, or ordinal data, and the significance threshold is a function of the number of categories more so than the size of the sample. For each category, the value is calculated by the formula (O-E)^2 / E. Where O = the observed category frequency and E is the expected category frequency. You then add up the value of that calculation for each category. Ideally, a category will not have an expected frequency of less than 5%. That is why I did some groupings at the tails of the distribution. When the frequency within a category is relatively close to the expected value, the chi-square value for that category is low. The table below shows a sample for the calculation in 2013 using 3 categories.

In this case, Everything over 2 standard deviations was grouped together to ensure that the expected frequencies were at least 5%. **As it turns out, the value of 27.0 is not only the largest on record, but it is by far the largest on record**. The next highest annual chi-square value is 10.0 in 1992. **Using this number of categories (3), any value greater than 5.99 is considered to be not normally distributed.** The next chart shows the value of the chi-square test statistic (orange line) along with the frequency distribution for the standard deviation categories between 1930 and 2013.

**Different Grouping Strategy:**

The chart above uses groups that are a whole standard deviation in size and ignores their signs. However, sign matters. Remember that 68% of observations are expected to fall within 1 standard deviation from the mean. If a year observed exactly 68% of days within 1 standard deviation but 50% were between -1 and 0 standard deviations and 18% were between 0 and +1 standard deviations, we would start to think that the temperature distribution wasn't quite normally distributed after all.

If we look at the frequency distributions using 0.5 standard deviation categories, the numbers look somewhat different. In fact, the three years with the largest chi-square values are different when the grouping strategy is different. The first chart below shows the standard deviation categories for six years using whole standard deviation units without regard to sign. The rightmost columns show the chi-square values for those years. The three largest chi-square years were among the six years shown.

The next chart shows the distribution of standard deviation categories using 1/2-standard deviation units **with** signs taken into account. To preserve the 5% rule described earlier, categories were grouped together at the tails. The same six years that are shown in the table above are also shown in the table below. In this case, the top three years are different. Note that the significance threshold for the chi-square statistic is larger when there are 8 categories (14.07 vs. 5.99).

As mentioned earlier, taking signs into account can reveal more to the story. Let's take 1987 as an example. Using the 3-category method, it had the 20th (out of 87) most extreme temperature distribution. However, using the 8-category method, it had the 1st most extreme temperature distribution. Why is that? As an example, if you don't look at signs, you would expect for 13.3% of days to be more than 1.5 standard deviations from the mean (+/-). In 1987, it was 10.1% of days. Certainly below normal not not especially noteworthy. If that were a grouped (unsigned) category, it would have a chi-square value of 0.77. If we take signs into account, we see that 1.6% of days were at least 1.5 standard deviations **below **the mean and 8.5% were at least 1.5 standard deviations **above **the mean. This gives a chi-square value of 4.4 just for those two categories (out of 8). 1987 was a full 4.5°F above normal so every category was strongly skewed toward the warm-sign category. However, when you add up companion warm/cold categories, the skew was masked.

A scatterplot shows that this type of occurrence is not unexpected. That is, years that are strongly above or below normal (the chart uses absolute value of annual temperature value) often have large chi-square values. 2013 is quite unique in having a large chi-square value along with a low annual temperature variation.

**Where does 2013 rank?**

So is 2013 the **most extreme** temperature year on record (since 1930) or is it the **6th most extreme** year on record. The answer is, both. Like many endeavors in statistics, the parameters that you choose make a huge difference.

****** Chart for response to comment No. 1 ****

This chart simulates the temperature frequency distribution in a warming climate. I generated 1000 daily temperatures with a constant mean and standard deviation. I then added an incremental background warming to the data and charted the standard deviations from the mean. Notice how the chart is similar to the first one in this post.