vignettes/fma-vignette.Rmd
fma-vignette.Rmd
This package contains a collection of datasets that are designed to accompany the textbook “Forecasting: Methods and Applications” by Makridakis, Wheelwright & Hyndman (Wiley, 3rd ed., 1998). The book can be purchased here.
When the fma
package is loaded, the forecast
package is also loaded, providing the functions to fit models and compute forecasts.
This vignette will give a brief introduction to the ways in which the datasets in the fma
package can be analysed using the forecast
package. Most of the figures from Chapter 2 are reproduced here. The figure numbers used here correspond to figures in the textbook.
There are a number of data sets that are included in the fma
package, which are referenced in examples throughout the book. Descriptions of the data sets that are used in this vignette are included below:
beer Monthly time series data for Australian beer production (in megalitres) from January 1991 to August 1995.
elec Monthly time series data for Australian electricity production from January 1956 to August 1995 in kWh.
milk Monthly time series data for milk production per cow (in pounds) over a 14 year time period.
The single most important thing to do when first exploring the data is to visualize it through graphs. The basic features of the data, including patterns and unusual observations, are most easily seen through graphs. Sometimes graphs also suggest possible explanations for some of the variation in the data.
Chapter 2 describes three main types of graph that are useful for forecasting. R code that can be used to reproduce these examples will be shown below.
For time series data the most obvious graphical form is a time plot (in which the data are plotted over time). A time plot immediately reveals any trends over time, any regular seasonal behavior, and other systematic features of the data.
Figure 2-1 below shows a time plot of the beer data. This reveals the range of the data and the time at which peaks occur, the relative size of the peaks and the randomness in the time series.
For time series data that have seasonal patterns, it is often useful to also produce a seasonal plot. This consists of the data plotted against the individuals “seasons” in which the data were observed. This is something like a time plot except that the data from each season are overlapped. A seasonal plot enables the underlying seasonal pattern more obvious, as well as more easily identify deviations from this pattern. Figure 2-2 shows a seasonal plot of the beer
data.
This can be done using the ggseasonplot()
function from the forecast
package:
ggseasonplot(beer, col = rainbow(5), year.labels = TRUE) +
ggtitle('Monthly Australian Beer Production') +
xlab('Months') + ylab('Megalitres') +
labs(caption = 'Figure 2-2: A seasonal plot of the Australian beer production data. Note that production peaks in
November and December in preparation for the southern hemisphere summer and is least in winter.')
Time plots and seasonal plots are not appropriate for cross-sectional data. Patterns in these data can be readily visualised with scatterplots. Scatterplots help to visualise the relationships between variables of interest in a data set. Figure 2-3 shows a plot of two variables from the auto
data set.
auto %>%
ggplot(aes(x = Mileage, y = Price)) +
geom_point() +
xlab("Mileage (mpg)") + ylab("Price ($US)") +
ggtitle("Price/Mileage Relationship for 45 Automobiles") +
labs(caption = 'Figure 2-3: A scatterplot of price versus mileage for the automobile data.')
The auto
data also has an additional explanatory variable Country. It is a categorical variable so cannot be plotted against Price in the same way as the continuous Mileage variable. However, we can include the Country information in our scatterplot with a shape aesthetic. This is displayed in Figure 2-4 below.
auto %>%
ggplot(aes(x = Mileage, y = Price, shape=Country)) +
geom_point(size=2) +
xlab("Mileage (mpg)") + ylab("Price ($US)") +
ggtitle("Price/Mileage Relationship for 45 Automobiles") +
labs(caption = 'Figure 2-4: A scatterplot showing price, mileage, and the country of origin for the automobile data.')
In addition to graphics, it is also helpful to provide numerical summaries. A summary number for a data set is called a statistic.
Table 2-4 shows the computation of some useful univariate statistics for the Mileage variable in the auto
dataset, after filtering the data set to only include cars from Japan. These statistics are defined and explained in the textbook. The following code produces these summary statistics:
auto_japan <- auto %>%
filter(Country == 'Japan')
auto_japan
#> Model Country Mileage Price
#> 1 Mazda MPV V6 Japan 19 14944
#> 2 Nissan Van 4 Japan 19 14799
#> 3 Acura Legend V6 Japan 20 24760
#> 4 Mitsubishi Wagon 4 Japan 20 14929
#> 5 Nissan Axxess 4 Japan 20 13949
#> 6 Mitsubishi Sigma V6 Japan 21 17879
#> 7 Nissan Stanza 4 Japan 21 11650
#> 8 Mazda 929 V6 Japan 21 23300
#> 9 Nissan Maxima V6 Japan 22 17899
#> 10 Toyota Cressida 6 Japan 23 21498
#> 11 Nissan 240SX 4 Japan 24 13249
#> 12 Subaru Loyale 4 Japan 25 9599
#> 13 Mitsubishi Galant 4 Japan 25 10989
#> 14 Honda Prelude Si 4WS 4 Japan 27 13945
#> 15 Subaru XT 4 Japan 28 13071
#> 16 Mazda Protege 4 Japan 32 6599
#> 17 Honda Civic CRX Si 4 Japan 33 9410
#> 18 Subaru Justy 3 Japan 34 5866
#> 19 Toyota Tercel 4 Japan 35 6488
Table 2-4
auto_japan %>%
summarise(mean = mean(Mileage),
median= median(Mileage),
MAD = sum(abs(Mileage - mean(Mileage)))/n(),
MSD = sum((Mileage - mean(Mileage))^2)/n(),
Variance = var(Mileage),
Std_Dev = sd(Mileage))
#> mean median MAD MSD Variance Std_Dev
#> 1 24.68421 23 4.371191 27.05817 28.5614 5.344287
This section introduces the concepts of covariance, correlation and auto-correlation. Table 2-6 below shows the calculation of these bivariate statistics for the auto
data with Country == "Japan"
(the same data used in Table 2-5). The variables being considered are Price and Mileage. For this section the units of Price will be converted to thousands of dollars. Calculation of these statistics is given below:
Table 2-6
auto_japan %>%
mutate(Price = Price/1000) %>%
summarise(mean_milage = mean(Mileage),
mean_price = mean(Price),
sd_mileage = sd(Mileage),
sd_price = sd(Price),
covariance = cov(Price, Mileage),
correlation = cor(Price, Mileage))
#> mean_milage mean_price sd_mileage sd_price covariance correlation
#> 1 24.68421 13.93805 5.344287 5.400006 -21.00154 -0.7277246
The covariance and correlation coefficient are statistics that measure the extent of the linear relationship between two variables and can be used to identify explanatory relationships. Autocovariance and autocorrelation are comparable measures that serve the same purpose for a single time series.
For example, if we compare \(Y_t\) (the observation at time \(t\)) with \(Y_{t−1}\) (the observation at time \(t−1\)), then we see how consecutive observations are related. The observation \(Y_{t−1}\) is described as lagged by one period.
Together the autocorrelations at lags 1, 2, …, make up the autocorrelation function or ACF. It is much easier to understand the autocorrelations by plotting them against the lag. This plot is known as a correlogram. Figure 2-6 shows the ACF for the beer data:
In this section of the book, a variety of measures of forecasting (or modelling) accuracy are defined. Functions from the forecast
package can be used to calculate these accuracy measures.
The textbook refers to a simple forecasting method called Naive Forecast 1 (NF1) which is used as a benchmark against which to compare more sophisticated methods. This method uses the most recent observation available as a forecast. Table 2-11 shows NF1 used to forecast the monthly beer production for the year of 1995.
window(beer, start=c(1994,12)) %>%
naive() %>%
accuracy()
#> ME RMSE MAE MPE MAPE MASE ACF1
#> Training set -3.625 25.28587 22.125 -3.724331 15.88498 NaN -0.3339436
It can be helpful to look at the autocorrelations in forecast errors. Figure 2.7 shows the forecast errors obtained by applying the NF1 method to the Australia monthly beer data.
Sometimes adjusting the data will lead to a simpler and more interpretable forecasting model. This section of the book deals with three kinds of adjustments. Examples of each are given below:
Figure 2-10 below shows a plot of the elec
data set.
elec %>%
autoplot() +
ggtitle("Australian Monthly Electricity Production") +
xlab("Year") + ylab("million kWh") +
labs(caption = "Figure 2-10: Monthly Australian electricity production from January 1956 to August 1995.
Note the increasing variation as the level of the series increases.")
This time plot shows that the size of the annual seasonal variation increases as the level of the series increases. The variation at the start (towards the left) is about 300 million kWh, while in more recent years (to the right) the variation exceeds 2500 kWh. A mathematical transformation is a convenient method for the necessary task of including this increasing variation into the forecasts for this data.
One possible transformation is the square root function. This transform is plotted below:
elec %>%
sqrt() %>%
autoplot() +
ggtitle("Square Root of Electricity Production") +
xlab("Year") + ylab("sqrt(million kWh)")
So we can see that this transformation has helped in reducing the variation in the seasonal cycles. This will make the forecasting task easier than the un-transformed data shown in Figure 2-10.
There are a number of other useful transformations. One of the most common is the logarithm, as it is relatively easy to interpret. Section 2/7 of the textbook gives the formulation for the set of power transformations. The plots below show a selection of these transformations.
Some of the variation in a time series may be due to the variation in the number of days (or trading days) each month. It is a good idea to adjust for this known source of variation to allow study of other interesting features.
Month length can make quite an impact since number of days in a month can differ by (31-28)/30 = 10%. If this is not removed, seasonal patterns become hard to interpret. Section 2/7/2 in the textbook shows how to make this adjustment.
Figure 2-12 shows the milk
data set with and without being adjusted for month length. The simpler pattern will lead to better forecasts and easier identification of unusual observations.
cbind(
Milk = milk,
`Milk per day` = milk/monthdays(milk)
) %>%
autoplot(facet=TRUE) +
ggtitle("Monthly Milk Production per Cow") +
xlab("Months") + ylab("Pounds") +
labs(caption="Figure 2-12: Monthly milk production per cow over 14 years.
The second graph shows the data adjusted for the length of the month.")