Skip to contents

Compute surprisals or surprisal probabilities from a model or a data set. A surprisal is given by \(s = -\log f(y)\) where \(f\) is the density or probability mass function of the estimated or assumed distribution, and \(y\) is an observation. A surprisal probability is the probability of a surprisal at least as extreme as \(s\).

The surprisal probabilities may be computed in three different ways.

  1. Given the same distribution that was used to compute the surprisal values. Under this option, surprisal probabilities are equal to 1 minus the coverage probability of the largest HDR that contains each value. Surprisal probabilities smaller than 1e-6 are returned as 1e-6.

  2. Using a Generalized Pareto Distribution fitted to the most extreme surprisal values (those with probability less than threshold_probability). This option is used if approximation = "gpd". For surprisal probabilities greater than threshold_probability, the value of threshold_probability is returned. Under this option, the distribution is used for computing the surprisal values but not for determining their probabilities. Due to extreme value theory, the resulting probabilities should be relatively insensitive to the distribution used in computing the surprisal values.

  3. Empirically as the proportion of observations with greater surprisal values. This option is used when approxiation = "empirical". This is also insensitive to the distribution used in computing the surprisal values.

Usage

surprisals(
  object,
  probability = TRUE,
  approximation = c("none", "gpd", "empirical"),
  threshold_probability = 0.1,
  ...
)

# Default S3 method
surprisals(
  object,
  probability = TRUE,
  approximation = c("none", "gpd", "empirical"),
  threshold_probability = 0.1,
  distribution = dist_kde(object, multiplier = 2, ...),
  loo = FALSE,
  ...
)

Arguments

object

A model or numerical data set

probability

Should surprisal probabilities be computed, or the surprisal values?

approximation

Character string specifying the approximation to use in computing the surprisal probabilities. Ignored if probability = FALSE. : none specifies that no approximation is to be used; gpd specifies that the Generalized Pareto distribution should be used; while empirical specifies that the probabilities should be estimated empirically.

threshold_probability

Probability threshold when computing the GPD approximation. This is the probability below which the GPD is fitted. Only used if approximation = "gpd").

...

Other arguments are passed to the appropriate method.

distribution

A distribution object. If not provided, a kernel density estimate is computed from the data object.

loo

Should leave-one-out surprisals be computed?

Value

A numerical vector containing the surprisals or surprisal probabilities.

Details

If no distribution is provided, a kernel density estimate is computed. The leave-one-out surprisals (or LOO surprisals) are obtained by estimating the kernel density estimate using all other observations.

See also

Author

Rob J Hyndman

Examples

# surprisals computed from bivariate data set
oldfaithful |>
  filter(duration < 7000, waiting < 7000) |>
  mutate(
    loo_fscores = surprisals(cbind(duration, waiting), loo = TRUE)
  )
#> # A tibble: 2,189 × 4
#>    time                duration waiting loo_fscores
#>    <dttm>                 <dbl>   <dbl>       <dbl>
#>  1 2015-01-02 14:53:00      271    5040      0.0594
#>  2 2015-01-09 23:55:00      247    6060      0.670 
#>  3 2015-02-07 00:49:00      203    5460      0.210 
#>  4 2015-02-14 01:09:00      195    5221      0.108 
#>  5 2015-02-21 01:12:00      210    5401      0.284 
#>  6 2015-02-28 01:11:00      185    5520      0.0690
#>  7 2015-03-07 00:50:00      160    5281      0.0101
#>  8 2015-03-13 21:57:00      226    6000      0.405 
#>  9 2015-03-13 23:37:00      190    5341      0.0886
#> 10 2015-03-20 22:26:00      102    3961      0.0905
#> # ℹ 2,179 more rows
# Univariate data
tibble(
  y = c(5, rnorm(49)),
  p_kde = surprisals(y, loo = TRUE),
  p_normal = surprisals(y, distribution = dist_normal()),
  p_zscore = 2*(1-pnorm(abs(y)))
)
#> # A tibble: 50 × 4
#>          y   p_kde    p_normal    p_zscore
#>      <dbl>   <dbl>       <dbl>       <dbl>
#>  1  5      0.00102 0.000000573 0.000000573
#>  2  0.213  0.890   0.831       0.831      
#>  3  0.519  0.848   0.604       0.604      
#>  4  2.00   0.246   0.0450      0.0450     
#>  5  1.07   0.603   0.285       0.285      
#>  6 -0.524  0.604   0.600       0.600      
#>  7  0.730  0.759   0.466       0.466      
#>  8  1.07   0.601   0.284       0.284      
#>  9  0.884  0.688   0.377       0.377      
#> 10 -0.0990 0.781   0.921       0.921      
#> # ℹ 40 more rows
tibble(
  y = n01$v1,
  prob1 = surprisals(y, loo = TRUE),
  prob2 = surprisals(y, approximation = "gpd"),
  prob3 = surprisals(y, distribution = dist_normal()),
  prob4 = surprisals(y, distribution = dist_normal(), approximation = "gpd")
) |>
  arrange(prob1)
#> # A tibble: 1,000 × 5
#>        y    prob1    prob2    prob3    prob4
#>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
#>  1  3.81 0.000675 0.000113 0.000139 0.000135
#>  2  3.06 0.00694  0.00194  0.00225  0.00261 
#>  3 -3.01 0.0109   0.00362  0.00263  0.00308 
#>  4 -3.00 0.0112   0.00376  0.00273  0.00320 
#>  5 -2.94 0.0129   0.00456  0.00328  0.00389 
#>  6 -2.89 0.0146   0.00538  0.00387  0.00461 
#>  7  2.68 0.0199   0.00810  0.00746  0.00913 
#>  8  2.65 0.0214   0.00887  0.00807  0.00991 
#>  9 -2.60 0.0283   0.0128   0.00943  0.0116  
#> 10 -2.59 0.0285   0.0130   0.00953  0.0118  
#> # ℹ 990 more rows
# Bivariate data
tibble(
  x = rnorm(50),
  y = c(5, rnorm(49)),
  prob = surprisals(cbind(x, y)),
  lookout = lookout_prob(cbind(x, y))
)
#> # A tibble: 50 × 4
#>         x       y  prob lookout
#>     <dbl>   <dbl> <dbl>   <dbl>
#>  1 -1.82   5       0.02 0.00171
#>  2  0.172  0.988   0.68 1      
#>  3  0.138  0.171   1    1      
#>  4  0.815 -0.735   0.72 1      
#>  5 -1.47   0.183   0.38 1      
#>  6  0.439  0.822   0.78 1      
#>  7  0.674  0.101   0.96 1      
#>  8 -0.601  0.763   0.7  1      
#>  9  1.98  -0.0791  0.34 1      
#> 10  0.175  0.658   0.86 1      
#> # ℹ 40 more rows