Compute the probability of a surprisal at least as extreme as those observed.
A surprisal is given by \(-\log(f)\) where \(f\) is the density or
probability mass function of the distribution.
The surprisal values are computed from the distribution
provided. If no
distribution is provided, a kernel density estimate is used.
Usage
surprisal_prob(
object,
distribution = NULL,
loo = FALSE,
GPD = FALSE,
threshold_probability = 0.1,
...
)
Arguments
- object
A model or numerical data set.
- distribution
A probability distribution stored as a distributional object. Ignored if
object
is a model.- loo
Logical value specifying if leave-one-out surprisals should be computed.
- GPD
Logical value specifying if a Generalized Pareto distribution should be used to estimate the probabilities.
- threshold_probability
Probability threshold when computing the GPD distribution for the surprisals.
- ...
Other arguments are passed to
surprisals
.
Details
The surprisal probabilities may be computed in three different ways.
Given the same
distribution
that was used to compute the surprisal values. Under this option, surprisal probabilities are equal to 1 minus the coverage probability of the largest HDR that contains each value. Surprisal probabilities smaller than 1e-6 are returned as 1e-6.Using a Generalized Pareto Distribution fitted to the most extreme surprisal values (those with probability less than
threshold_probability
). This option is used ifGPD = TRUE
. For surprisal values with probability less thanthreshold_probability
, the value ofthreshold_probability
is returned. Under this option, the distribution is used for computing the surprisal values but not for determining their probabilities. Due to extreme value theory, the resulting probabilities should be relatively insensitive to the distribution used in computing the surprisal values.Empirically as the proportion of observations with greater surprisal values. This option is used when
GPD = FALSE
and no distribution is explicitly provided. This is also insensitive to the distribution used in computing the surprisal values.
Examples
# Univariate data
tibble(
y = c(5, rnorm(49)),
p = surprisal_prob(y)
)
#> # A tibble: 50 × 2
#> y p
#> <dbl> <dbl>
#> 1 5 0.0200
#> 2 -0.189 0.98
#> 3 0.0828 0.8
#> 4 -0.352 0.82
#> 5 -0.929 0.52
#> 6 -0.583 0.66
#> 7 0.227 0.72
#> 8 -1.01 0.46
#> 9 1.22 0.28
#> 10 -0.313 0.86
#> # ℹ 40 more rows
tibble(
y = n01$v1,
prob1 = surprisal_prob(y),
prob2 = surprisal_prob(y, GPD = TRUE),
prob3 = surprisal_prob(y, dist_normal()),
prob4 = surprisal_prob(y, dist_normal(), GPD = TRUE)
) |>
arrange(prob1)
#> # A tibble: 1,000 × 5
#> y prob1 prob2 prob3 prob4
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3.81 0.00100 0.000113 0.000139 0.000135
#> 2 3.06 0.00200 0.00194 0.00225 0.00261
#> 3 -3.01 0.00300 0.00362 0.00263 0.00308
#> 4 -3.00 0.00400 0.00376 0.00273 0.00320
#> 5 -2.94 0.00500 0.00456 0.00328 0.00389
#> 6 -2.89 0.00600 0.00538 0.00387 0.00461
#> 7 2.68 0.00700 0.00810 0.00746 0.00913
#> 8 2.65 0.00800 0.00887 0.00807 0.00991
#> 9 -2.60 0.00900 0.0128 0.00943 0.0116
#> 10 -2.59 0.0100 0.0130 0.00953 0.0118
#> # ℹ 990 more rows
# Bivariate data
tibble(
x = rnorm(50),
y = c(5, rnorm(49)),
lookout = lookout_prob(cbind(x, y))
)
#> # A tibble: 50 × 3
#> x y lookout
#> <dbl> <dbl> <dbl>
#> 1 0.853 5 0.0390
#> 2 -0.393 0.471 1
#> 3 0.817 0.746 1
#> 4 -1.24 1.94 0.478
#> 5 -0.464 -0.0194 1
#> 6 -1.00 -0.390 1
#> 7 -0.926 -1.82 1
#> 8 -0.0253 0.247 1
#> 9 1.41 0.779 1
#> 10 0.453 0.468 1
#> # ℹ 40 more rows
# Using a regression model
of <- oldfaithful |> filter(duration < 7200, waiting < 7200)
fit_of <- lm(waiting ~ duration, data = of)
of |>
mutate(p = surprisal_prob(fit_of)) |>
arrange(p)
#> # A tibble: 2,197 × 4
#> time duration waiting p
#> <dttm> <dbl> <dbl> <dbl>
#> 1 2018-04-25 19:08:00 1 5700 0.000455
#> 2 2020-06-01 21:04:00 120 6060 0.000910
#> 3 2021-08-13 22:19:23 210 6971 0.00137
#> 4 2020-10-15 17:11:00 220 7080 0.00182
#> 5 2016-11-11 14:23:00 180 6480 0.00228
#> 6 2021-07-26 18:35:39 192 6618 0.00273
#> 7 2017-02-25 00:53:00 201 6720 0.00319
#> 8 2015-06-17 23:06:00 210 6780 0.00364
#> 9 2021-05-21 23:21:09 222 6891 0.00410
#> 10 2020-09-16 14:44:00 160 6120 0.00455
#> # ℹ 2,187 more rows