Compute surprisals or surprisal probabilities from a model or a data set. A surprisal is given by \(s = -\log f(y)\) where \(f\) is the density or probability mass function of the estimated or assumed distribution, and \(y\) is an observation. A surprisal probability is the probability of a surprisal at least as extreme as \(s\).
The surprisal probabilities may be computed in three different ways.
Given the same distribution that was used to compute the surprisal values. Under this option, surprisal probabilities are equal to 1 minus the coverage probability of the largest HDR that contains each value. Surprisal probabilities smaller than 1e-6 are returned as 1e-6.
Using a Generalized Pareto Distribution fitted to the most extreme surprisal values (those with probability less than
threshold_probability
). This option is used ifapproximation = "gpd"
. For surprisal probabilities greater thanthreshold_probability
, the value ofthreshold_probability
is returned. Under this option, the distribution is used for computing the surprisal values but not for determining their probabilities. Due to extreme value theory, the resulting probabilities should be relatively insensitive to the distribution used in computing the surprisal values.Empirically as the proportion of observations with greater surprisal values. This option is used when
approxiation = "empirical"
. This is also insensitive to the distribution used in computing the surprisal values.
Usage
surprisals(
object,
probability = TRUE,
approximation = c("none", "gpd", "empirical"),
threshold_probability = 0.1,
...
)
# Default S3 method
surprisals(
object,
probability = TRUE,
approximation = c("none", "gpd", "empirical"),
threshold_probability = 0.1,
distribution = dist_kde(object, multiplier = 2, ...),
loo = FALSE,
...
)
Arguments
- object
A model or numerical data set
- probability
Should surprisal probabilities be computed, or the surprisal values?
- approximation
Character string specifying the approximation to use in computing the surprisal probabilities. Ignored if
probability = FALSE
. :none
specifies that no approximation is to be used;gpd
specifies that the Generalized Pareto distribution should be used; whileempirical
specifies that the probabilities should be estimated empirically.- threshold_probability
Probability threshold when computing the GPD approximation. This is the probability below which the GPD is fitted. Only used if
approximation = "gpd"
).- ...
Other arguments are passed to the appropriate method.
- distribution
A distribution object. If not provided, a kernel density estimate is computed from the data
object
.- loo
Should leave-one-out surprisals be computed?
Details
If no distribution is provided, a kernel density estimate is computed. The leave-one-out surprisals (or LOO surprisals) are obtained by estimating the kernel density estimate using all other observations.
Examples
# surprisals computed from bivariate data set
oldfaithful |>
filter(duration < 7000, waiting < 7000) |>
mutate(
loo_fscores = surprisals(cbind(duration, waiting), loo = TRUE)
)
#> # A tibble: 2,189 × 4
#> time duration waiting loo_fscores
#> <dttm> <dbl> <dbl> <dbl>
#> 1 2015-01-02 14:53:00 271 5040 0.0594
#> 2 2015-01-09 23:55:00 247 6060 0.670
#> 3 2015-02-07 00:49:00 203 5460 0.210
#> 4 2015-02-14 01:09:00 195 5221 0.108
#> 5 2015-02-21 01:12:00 210 5401 0.284
#> 6 2015-02-28 01:11:00 185 5520 0.0690
#> 7 2015-03-07 00:50:00 160 5281 0.0101
#> 8 2015-03-13 21:57:00 226 6000 0.405
#> 9 2015-03-13 23:37:00 190 5341 0.0886
#> 10 2015-03-20 22:26:00 102 3961 0.0905
#> # ℹ 2,179 more rows
# Univariate data
tibble(
y = c(5, rnorm(49)),
p_kde = surprisals(y, loo = TRUE),
p_normal = surprisals(y, distribution = dist_normal()),
p_zscore = 2 * (1 - pnorm(abs(y)))
)
#> # A tibble: 50 × 4
#> y p_kde p_normal p_zscore
#> <dbl> <dbl> <dbl> <dbl>
#> 1 5 0.00102 0.000000573 0.000000573
#> 2 0.213 0.890 0.831 0.831
#> 3 0.519 0.848 0.604 0.604
#> 4 2.00 0.246 0.0450 0.0450
#> 5 1.07 0.603 0.285 0.285
#> 6 -0.524 0.604 0.600 0.600
#> 7 0.730 0.759 0.466 0.466
#> 8 1.07 0.601 0.284 0.284
#> 9 0.884 0.688 0.377 0.377
#> 10 -0.0990 0.781 0.921 0.921
#> # ℹ 40 more rows
tibble(
y = n01$v1,
prob1 = surprisals(y, loo = TRUE),
prob2 = surprisals(y, approximation = "gpd"),
prob3 = surprisals(y, distribution = dist_normal()),
prob4 = surprisals(y, distribution = dist_normal(), approximation = "gpd")
) |>
arrange(prob1)
#> # A tibble: 1,000 × 5
#> y prob1 prob2 prob3 prob4
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3.81 0.000675 0.000113 0.000139 0.000135
#> 2 3.06 0.00694 0.00194 0.00225 0.00261
#> 3 -3.01 0.0109 0.00362 0.00263 0.00308
#> 4 -3.00 0.0112 0.00376 0.00273 0.00320
#> 5 -2.94 0.0129 0.00456 0.00328 0.00389
#> 6 -2.89 0.0146 0.00538 0.00387 0.00461
#> 7 2.68 0.0199 0.00810 0.00746 0.00913
#> 8 2.65 0.0214 0.00887 0.00807 0.00991
#> 9 -2.60 0.0283 0.0128 0.00943 0.0116
#> 10 -2.59 0.0285 0.0130 0.00953 0.0118
#> # ℹ 990 more rows
# Bivariate data
tibble(
x = rnorm(50),
y = c(5, rnorm(49)),
prob = surprisals(cbind(x, y)),
lookout = lookout_prob(cbind(x, y))
)
#> # A tibble: 50 × 4
#> x y prob lookout
#> <dbl> <dbl> <dbl> <dbl>
#> 1 -1.82 5 0.02 0.00171
#> 2 0.172 0.988 0.68 1
#> 3 0.138 0.171 1 1
#> 4 0.815 -0.735 0.72 1
#> 5 -1.47 0.183 0.38 1
#> 6 0.439 0.822 0.78 1
#> 7 0.674 0.101 0.96 1
#> 8 -0.601 0.763 0.7 1
#> 9 1.98 -0.0791 0.34 1
#> 10 0.175 0.658 0.86 1
#> # ℹ 40 more rows