Skip to contents

Compute leave-one-out log score probabilities using a Generalized Pareto distribution. These give the probability of each observation being an anomaly.

Usage

lookout(
  object = NULL,
  density_scores = NULL,
  loo_scores = density_scores,
  threshold_probability = 0.95
)

Arguments

object

A model object or a numerical data set.

density_scores

Numerical vector of log scores

loo_scores

Optional numerical vector of leave-one-out log scores

threshold_probability

Probability threshold when computing the POT model for the log scores.

Value

A numerical vector containing the lookout probabilities

Details

This function can work with several object types. If object is not NULL, then the object is passed to density_scores to compute density scores (and possibly LOO density scores). Otherwise, the density scores are taken from the density_scores argument, and the LOO density scores are taken from the loo_scores argument. Then the Generalized Pareto distribution is fitted to the scores, to obtain the probability of each observation.

References

Sevvandi Kandanaarachchi & Rob J Hyndman (2022) "Leave-one-out kernel density estimates for outlier detection", J Computational & Graphical Statistics, 31(2), 586-599. https://robjhyndman.com/publications/lookout/

Author

Rob J Hyndman

Examples

# Univariate data
tibble(
  y = c(5, rnorm(49)),
  lookout = lookout(y)
)
#> # A tibble: 50 × 2
#>          y lookout
#>      <dbl>   <dbl>
#>  1  5            0
#>  2 -1.03         1
#>  3  0.580        1
#>  4 -1.03         1
#>  5  0.731        1
#>  6  0.0221       1
#>  7  1.13         1
#>  8  0.0501       1
#>  9  1.08         1
#> 10  1.22         1
#> # ℹ 40 more rows
# Bivariate data with score calculation done outside the function
tibble(
  x = rnorm(50),
  y = c(5, rnorm(49)),
  fscores = density_scores(y),
  loo_fscores = density_scores(y, loo = TRUE),
  lookout = lookout(density_scores = fscores, loo_scores = loo_fscores)
)
#> # A tibble: 50 × 5
#>         x      y fscores loo_fscores lookout
#>     <dbl>  <dbl>   <dbl>       <dbl>   <dbl>
#>  1  0.385  5        4.79        6.62       0
#>  2  1.53   0.486    1.42        1.43       1
#>  3  1.10  -0.817    1.49        1.50       1
#>  4  1.56   2.31     2.48        2.55       1
#>  5 -0.140 -0.400    1.39        1.39       1
#>  6 -0.134 -1.00     1.55        1.57       1
#>  7  0.451  0.183    1.37        1.38       1
#>  8  1.64  -0.751    1.46        1.47       1
#>  9  0.792 -0.550    1.41        1.42       1
#> 10  0.221  0.941    1.56        1.58       1
#> # ℹ 40 more rows
# Using a regression model
of <- oldfaithful |> filter(duration < 7200, waiting < 7200)
fit_of <- lm(waiting ~ duration, data = of)
of |>
  mutate(lookout_prob = lookout(fit_of)) |>
  arrange(lookout_prob)
#> # A tibble: 2,197 × 4
#>    time                duration waiting lookout_prob
#>    <dttm>                 <dbl>   <dbl>        <dbl>
#>  1 2018-04-25 19:08:00        1    5700     0.000990
#>  2 2020-06-01 21:04:00      120    6060     0.0192  
#>  3 2021-08-13 22:19:23      210    6971     0.0356  
#>  4 2020-10-15 17:11:00      220    7080     0.0371  
#>  5 2016-11-11 14:23:00      180    6480     0.0572  
#>  6 2021-07-26 18:35:39      192    6618     0.0587  
#>  7 2017-02-25 00:53:00      201    6720     0.0603  
#>  8 2015-06-17 23:06:00      210    6780     0.0728  
#>  9 2021-05-21 23:21:09      222    6891     0.0833  
#> 10 2020-09-16 14:44:00      160    6120     0.0908  
#> # ℹ 2,187 more rows