Surprisal probabilities — surprisal

Compute the probability of a surprisal at least as extreme as those observed. A surprisal is given by \(-\log(f)\) where \(f\) is the density or probability mass function of the distribution. The surprisal values are computed from the distribution provided. If no distribution is provided, a kernel density estimate is used.

Usage

surprisal_prob(
  object,
  distribution = NULL,
  loo = FALSE,
  GPD = FALSE,
  threshold_probability = 0.1,
  ...
)

Arguments

object: A model or numerical data set.
distribution: A probability distribution stored as a distributional object. Ignored if object is a model.
loo: Logical value specifying if leave-one-out surprisals should be computed.
GPD: Logical value specifying if a Generalized Pareto distribution should be used to estimate the probabilities.
threshold_probability: Probability threshold when computing the GPD distribution for the surprisals.
...: Other arguments are passed to surprisals.

Details

The surprisal probabilities may be computed in three different ways.

Given the same distribution that was used to compute the surprisal values. Under this option, surprisal probabilities are equal to 1 minus the coverage probability of the largest HDR that contains each value. Surprisal probabilities smaller than 1e-6 are returned as 1e-6.
Using a Generalized Pareto Distribution fitted to the most extreme surprisal values (those with probability less than threshold_probability). This option is used if GPD = TRUE. For surprisal values with probability less than threshold_probability, the value of threshold_probability is returned. Under this option, the distribution is used for computing the surprisal values but not for determining their probabilities. Due to extreme value theory, the resulting probabilities should be relatively insensitive to the distribution used in computing the surprisal values.
Empirically as the proportion of observations with greater surprisal values. This option is used when GPD = FALSE and no distribution is explicitly provided. This is also insensitive to the distribution used in computing the surprisal values.

Examples

# Univariate data
tibble(
  y = c(5, rnorm(49)),
  p = surprisal_prob(y)
)
#> # A tibble: 50 × 2
#>          y      p
#>      <dbl>  <dbl>
#>  1  5      0.0200
#>  2 -0.189  0.98  
#>  3  0.0828 0.8   
#>  4 -0.352  0.82  
#>  5 -0.929  0.52  
#>  6 -0.583  0.66  
#>  7  0.227  0.72  
#>  8 -1.01   0.46  
#>  9  1.22   0.28  
#> 10 -0.313  0.86  
#> # ℹ 40 more rows
tibble(
  y = n01$v1,
  prob1 = surprisal_prob(y),
  prob2 = surprisal_prob(y, GPD = TRUE),
  prob3 = surprisal_prob(y, dist_normal()),
  prob4 = surprisal_prob(y, dist_normal(), GPD = TRUE)
) |>
  arrange(prob1)
#> # A tibble: 1,000 × 5
#>        y   prob1    prob2    prob3    prob4
#>    <dbl>   <dbl>    <dbl>    <dbl>    <dbl>
#>  1  3.81 0.00100 0.000113 0.000139 0.000135
#>  2  3.06 0.00200 0.00194  0.00225  0.00261 
#>  3 -3.01 0.00300 0.00362  0.00263  0.00308 
#>  4 -3.00 0.00400 0.00376  0.00273  0.00320 
#>  5 -2.94 0.00500 0.00456  0.00328  0.00389 
#>  6 -2.89 0.00600 0.00538  0.00387  0.00461 
#>  7  2.68 0.00700 0.00810  0.00746  0.00913 
#>  8  2.65 0.00800 0.00887  0.00807  0.00991 
#>  9 -2.60 0.00900 0.0128   0.00943  0.0116  
#> 10 -2.59 0.0100  0.0130   0.00953  0.0118  
#> # ℹ 990 more rows
# Bivariate data
tibble(
  x = rnorm(50),
  y = c(5, rnorm(49)),
  lookout = lookout_prob(cbind(x, y))
)
#> # A tibble: 50 × 3
#>          x       y lookout
#>      <dbl>   <dbl>   <dbl>
#>  1  0.853   5       0.0390
#>  2 -0.393   0.471   1     
#>  3  0.817   0.746   1     
#>  4 -1.24    1.94    0.478 
#>  5 -0.464  -0.0194  1     
#>  6 -1.00   -0.390   1     
#>  7 -0.926  -1.82    1     
#>  8 -0.0253  0.247   1     
#>  9  1.41    0.779   1     
#> 10  0.453   0.468   1     
#> # ℹ 40 more rows
# Using a regression model
of <- oldfaithful |> filter(duration < 7200, waiting < 7200)
fit_of <- lm(waiting ~ duration, data = of)
of |>
  mutate(p = surprisal_prob(fit_of)) |>
  arrange(p)
#> # A tibble: 2,197 × 4
#>    time                duration waiting        p
#>    <dttm>                 <dbl>   <dbl>    <dbl>
#>  1 2018-04-25 19:08:00        1    5700 0.000455
#>  2 2020-06-01 21:04:00      120    6060 0.000910
#>  3 2021-08-13 22:19:23      210    6971 0.00137 
#>  4 2020-10-15 17:11:00      220    7080 0.00182 
#>  5 2016-11-11 14:23:00      180    6480 0.00228 
#>  6 2021-07-26 18:35:39      192    6618 0.00273 
#>  7 2017-02-25 00:53:00      201    6720 0.00319 
#>  8 2015-06-17 23:06:00      210    6780 0.00364 
#>  9 2021-05-21 23:21:09      222    6891 0.00410 
#> 10 2020-09-16 14:44:00      160    6120 0.00455 
#> # ℹ 2,187 more rows