Skip to content

Commit

Permalink
pcaout
Browse files Browse the repository at this point in the history
  • Loading branch information
mlesnoff committed Dec 25, 2024
1 parent 8cbbbd6 commit fa98d1d
Show file tree
Hide file tree
Showing 7 changed files with 31 additions and 33 deletions.
2 changes: 1 addition & 1 deletion docs/src/domains.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@

### Factorial discrimination analysis (FDA)

- **fda** Eigen decomposition of the compromise "inter/intra"
- **fda** Eigen decomposition of the consensus "inter/intra"
- **fdasvd** Weighted SVD of the class centers

### Multiblock
Expand Down
2 changes: 1 addition & 1 deletion docs/src/news.md
Original file line number Diff line number Diff line change
Expand Up @@ -703,7 +703,7 @@ the Makie's backend (e.g. CairoMakie).

- New
- **colmad**: Median absolute deviation (MAD) of each column of a matrix.
- **occsdod**: One-class classification using a compromise between PCA/PLS score (SD) and orthogonal (OD) distances.
- **occsdod**: One-class classification using a consensus between PCA/PLS score (SD) and orthogonal (OD) distances.
- **replacedict**: Replace the elements of a vector by levels defined in a dictionary.
- **stah**: Stahel-Donoho outlierness measure.

Expand Down
2 changes: 1 addition & 1 deletion src/fda.jl
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ FDA by eigen factorization of Inverse(W) * B, where W is
the "Within"-covariance matrix (pooled over the classes),
and B the "Between"-covariance matrix.
The function maximizes the compromise:
The function maximizes the consensus:
* p'Bp / p'Wp
i.e. max p'Bp with constraint p'Wp = 1. Vectors p (columns of `P`)
are the linear discrimant coefficients often referred to as "LD".
Expand Down
16 changes: 7 additions & 9 deletions src/occod.jl
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,18 @@
occod(; kwargs...)
occod(fitm, X; kwargs...)
One-class classification using PCA/PLS orthognal distance (OD).
* `fitm` : The preliminary model (e.g. PCA; object `fitm`) that was fitted
on the training data assumed to represent the training class.
* `X` : Training X-data (n, p), on which was fitted
the model `fitm`.
* `fitm` : The preliminary model (e.g. object `fitm` returned by function
`pcasvd`) that was fitted on the training data assumed to represent
the training class.
* `X` : Training X-data (n, p), on which was fitted the model `fitm`.
Keyword arguments:
* `mcut` : Type of cutoff. Possible values are: `:mad`, `:q`.
See Thereafter.
* `cri` : When `mcut` = `:mad`, a constant. See thereafter.
* `risk` : When `mcut` = `:q`, a risk-I level. See thereafter.
In this method, the outlierness `d` of an observation
is the orthogonal distance (OD = "X-residuals") of this
is the orthogonal distance (= 'X-residuals') of this
observation, ie. the Euclidean distance between the observation
and its projection on the score plan defined by the fitted
(e.g. PCA) model (e.g. Hubert et al. 2005, Van Branden & Hubert
Expand Down Expand Up @@ -117,8 +117,7 @@ function occod(fitm, X; kwargs...)
par = recovkw(ParOcc, kwargs).par
@assert 0 <= par.risk <= 1 "Argument 'risk' must ∈ [0, 1]."
E = xresid(fitm, X)
d2 = vec(sum(E .* E, dims = 2))
d = sqrt.(d2)
d = rownorm(E)
par.mcut == :mad ? cutoff = median(d) + par.cri * madv(d) : nothing
par.mcut == :q ? cutoff = quantile(d, 1 - par.risk) : nothing
e_cdf = StatsBase.ecdf(d)
Expand All @@ -136,8 +135,7 @@ Compute predictions from a fitted model.
function predict(object::Occod, X)
E = xresid(object.fitm, X)
m = nro(E)
d2 = vec(sum(E .* E, dims = 2))
d = sqrt.(d2)
d = rownorm(E)
p_val = pval(object.e_cdf, d)
d = DataFrame(d = d, dstand = d / object.cutoff, pval = p_val)
pred = [if d.dstand[i] <= 1 "in" else "out" end for i = 1:m]
Expand Down
5 changes: 3 additions & 2 deletions src/occsd.jl
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,9 @@
occsd(; kwargs...)
occsd(fitm; kwargs...)
One-class classification using PCA/PLS score distance (SD).
* `fitm` : The preliminary model (e.g. PCA; object `fitm`) that was fitted
on the training data assumed to represent the training class.
* `fitm` : The preliminary model (e.g. object `fitm` returned by function
`pcasvd`) that was fitted on the training data assumed to represent
the training class.
Keyword arguments:
* `mcut` : Type of cutoff. Possible values are: `:mad`, `:q`.
See Thereafter.
Expand Down
20 changes: 10 additions & 10 deletions src/occsdod.jl
Original file line number Diff line number Diff line change
@@ -1,21 +1,21 @@
"""
occsdod(; kwargs...)
occsdod(object, X; kwargs...)
One-class classification using a compromise between
PCA/PLS score (SD) and orthogonal (OD) distances.
* `fitm` : The preliminary model (e.g. PCA; object `fitm`) that was fitted
on the training data assumed to represent the training class.
* `X` : Training X-data (n, p), on which was fitted
the model `fitm`.
One-class classification using a consensus between
PCA/PLS score and orthogonal (SD and OD) distances.
* `fitm` : The preliminary model (e.g. object `fitm` returned by function
`pcasvd`) that was fitted on the training data assumed to represent
the training class.
* `X` : Training X-data (n, p), on which was fitted the model `fitm`.
Keyword arguments:
* `mcut` : Type of cutoff. Possible values are: `:mad`, `:q`.
See Thereafter.
* `cri` : When `mcut` = `:mad`, a constant. See thereafter.
* `risk` : When `mcut` = `:q`, a risk-I level. See thereafter.
In this method, the outlierness `d` of a given observation
is a compromise between the score distance (SD) and the
orthogonal distance (OD). The compromise is computed from the
is a consensus between the score distance (SD) and the
orthogonal distance (OD). The consensus is computed from the
standardized distances by:
* `dstand` = sqrt(`dstand_sd` * `dstand_od`).
Expand All @@ -31,7 +31,7 @@ function occsdod(fitm, X; kwargs...)
fitmod = occod(fitm, X; kwargs...)
sd = fitmsd.d
od = fitmod.d
z = sqrt.(sd.dstand .* od.dstand)
z = [sqrt(sd.dstand[i] * od.dstand[i]) for i in eachindex(sd.dstand)]
nam = string.(names(sd), "_sd")
rename!(sd, nam)
nam = string.(names(od), "_od")
Expand All @@ -56,7 +56,7 @@ function predict(object::Occsdod, X)
nam = string.(names(od), "_od")
rename!(od, nam)
d = hcat(sd, od)
d.dstand = sqrt.(sd.dstand_sd .* od.dstand_od)
d.dstand = [sqrt(sd.dstand_sd[i] * od.dstand_od[i]) for i in eachindex(sd.dstand_sd)]
pred = [if d.dstand[i] <= 1 "in" else "out" end for i = 1:m]
pred = reshape(pred, m, 1)
(pred = pred, d)
Expand Down
17 changes: 8 additions & 9 deletions src/pcaout.jl
Original file line number Diff line number Diff line change
Expand Up @@ -17,18 +17,18 @@ Keyword arguments:
Robust PCA combining outlyingness measures and weighted PCA (WPCA).
Observations (`X`-rows) receive weights depending on outlyingness
with the objective to remove the effect of multivariate `X`-outliers
that have potentially bad leverages.
The objective is to remove the effect of multivariate `X`-outliers
that have potentially bad leverages. Observations (`X`-rows) receive
weights depending on two outlyingness indicators:
1) The Stahel-Donoho outlyingness (Maronna and Yohai, 1995) is computed
(function `outstah`). The proportion `prm` of the observations with the
(function `outstah`) on `X`. The proportion `prm` of the observations with the
highest outlyingness values receive a weight w1 = 0 (the other receive a weight w1 = 1).
2) An outlyingness computed from the Euclidean distance to center
2) An outlyingness based on the Euclidean distance to center
(function `outstah`) is computed. The proportion `prm` of the observations
with the highest outlyingness values receive a weight w2 = 0
(the other receive a weight w2 = 1).
3) The final weights of the observations are computed by weights.w * w1 * w2
that is used in a weighted PCA.
The final weights of the observations are computed by weights.w * w1 * w2
that is used in a weighted PCA.
By default, the function uses `prm = .3` (such as in the ROBPCA algorithm
of Hubert et al. 2005, 2009).
Expand Down Expand Up @@ -92,8 +92,7 @@ function pcaout!(X::Matrix, weights::Weight; kwargs...)
n, p = size(X)
nlvout = 30
P = rand(0:1, p, nlvout)
d = similar(X, n)
d .= outstah(X, P; scal = par.scal).d
d = outstah(X, P; scal = par.scal).d
w = wtal(d; a = quantile(d, 1 - par.prm))
d .= outeucl(X; scal = par.scal).d
w .*= wtal(d; a = quantile(d, 1 - par.prm))
Expand Down

0 comments on commit fa98d1d

Please sign in to comment.