Compute leave-one-out log score probabilities using a Generalized Pareto distribution. These give the probability of each observation being an anomaly.
Usage
lookout(
object = NULL,
density_scores = NULL,
loo_scores = density_scores,
threshold_probability = 0.95
)
Arguments
- object
A model object or a numerical data set.
- density_scores
Numerical vector of log scores
- loo_scores
Optional numerical vector of leave-one-out log scores
- threshold_probability
Probability threshold when computing the POT model for the log scores.
Details
This function can work with several object types.
If object
is not NULL
, then the object is passed to density_scores
to compute density scores (and possibly LOO density scores). Otherwise,
the density scores are taken from the density_scores
argument, and the
LOO density scores are taken from the loo_scores
argument. Then the Generalized
Pareto distribution is fitted to the scores, to obtain the probability of each observation.
References
Sevvandi Kandanaarachchi & Rob J Hyndman (2022) "Leave-one-out kernel density estimates for outlier detection", J Computational & Graphical Statistics, 31(2), 586-599. https://robjhyndman.com/publications/lookout/
Examples
# Univariate data
tibble(
y = c(5, rnorm(49)),
lookout = lookout(y)
)
#> # A tibble: 50 × 2
#> y lookout
#> <dbl> <dbl>
#> 1 5 0
#> 2 -1.03 1
#> 3 0.580 1
#> 4 -1.03 1
#> 5 0.731 1
#> 6 0.0221 1
#> 7 1.13 1
#> 8 0.0501 1
#> 9 1.08 1
#> 10 1.22 1
#> # ℹ 40 more rows
# Bivariate data with score calculation done outside the function
tibble(
x = rnorm(50),
y = c(5, rnorm(49)),
fscores = density_scores(y),
loo_fscores = density_scores(y, loo = TRUE),
lookout = lookout(density_scores = fscores, loo_scores = loo_fscores)
)
#> # A tibble: 50 × 5
#> x y fscores loo_fscores lookout
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.385 5 4.79 6.62 0
#> 2 1.53 0.486 1.42 1.43 1
#> 3 1.10 -0.817 1.49 1.50 1
#> 4 1.56 2.31 2.48 2.55 1
#> 5 -0.140 -0.400 1.39 1.39 1
#> 6 -0.134 -1.00 1.55 1.57 1
#> 7 0.451 0.183 1.37 1.38 1
#> 8 1.64 -0.751 1.46 1.47 1
#> 9 0.792 -0.550 1.41 1.42 1
#> 10 0.221 0.941 1.56 1.58 1
#> # ℹ 40 more rows
# Using a regression model
of <- oldfaithful |> filter(duration < 7200, waiting < 7200)
fit_of <- lm(waiting ~ duration, data = of)
of |>
mutate(lookout_prob = lookout(fit_of)) |>
arrange(lookout_prob)
#> # A tibble: 2,197 × 4
#> time duration waiting lookout_prob
#> <dttm> <dbl> <dbl> <dbl>
#> 1 2018-04-25 19:08:00 1 5700 0.000990
#> 2 2020-06-01 21:04:00 120 6060 0.0192
#> 3 2021-08-13 22:19:23 210 6971 0.0356
#> 4 2020-10-15 17:11:00 220 7080 0.0371
#> 5 2016-11-11 14:23:00 180 6480 0.0572
#> 6 2021-07-26 18:35:39 192 6618 0.0587
#> 7 2017-02-25 00:53:00 201 6720 0.0603
#> 8 2015-06-17 23:06:00 210 6780 0.0728
#> 9 2021-05-21 23:21:09 222 6891 0.0833
#> 10 2020-09-16 14:44:00 160 6120 0.0908
#> # ℹ 2,187 more rows