pcaout

mlesnoff · Dec 25, 2024 · fa98d1d · fa98d1d
1 parent 8cbbbd6
commit fa98d1d
Show file tree

Hide file tree

Showing 7 changed files with 31 additions and 33 deletions.
diff --git a/docs/src/domains.md b/docs/src/domains.md
@@ -44,7 +44,7 @@
 
 ### Factorial discrimination analysis (FDA)
 
-- **fda** Eigen decomposition of the compromise "inter/intra"
+- **fda** Eigen decomposition of the consensus "inter/intra"
 - **fdasvd** Weighted SVD of the class centers
 
 ### Multiblock

diff --git a/docs/src/news.md b/docs/src/news.md
@@ -703,7 +703,7 @@ the Makie's backend (e.g. CairoMakie).
 
 - New 
     - **colmad**: Median absolute deviation (MAD) of each column of a matrix.
-    - **occsdod**: One-class classification using a compromise between PCA/PLS score (SD) and orthogonal (OD) distances.
+    - **occsdod**: One-class classification using a consensus between PCA/PLS score (SD) and orthogonal (OD) distances.
     - **replacedict**: Replace the elements of a vector by levels defined in a dictionary.
     - **stah**: Stahel-Donoho outlierness measure.
 

diff --git a/src/fda.jl b/src/fda.jl
@@ -24,7 +24,7 @@ FDA by eigen factorization of Inverse(W) * B, where W is
 the "Within"-covariance matrix (pooled over the classes), 
 and B the "Between"-covariance matrix.
 
-The function maximizes the compromise:
+The function maximizes the consensus:
 * p'Bp / p'Wp 
 i.e. max p'Bp with constraint p'Wp = 1. Vectors p (columns of `P`) 
 are the linear discrimant coefficients often referred to as "LD".

diff --git a/src/occod.jl b/src/occod.jl
@@ -2,18 +2,18 @@
     occod(; kwargs...)
     occod(fitm, X; kwargs...)
 One-class classification using PCA/PLS orthognal distance (OD).
-* `fitm` : The preliminary model (e.g. PCA; object `fitm`) that was fitted 
-    on the training data assumed to represent the training class.
-* `X` : Training X-data (n, p), on which was fitted 
-    the model `fitm`.
+* `fitm` : The preliminary model (e.g. object `fitm` returned by function 
+    `pcasvd`) that was fitted on the training data assumed to represent 
+    the training class.
+* `X` : Training X-data (n, p), on which was fitted the model `fitm`.
 Keyword arguments:
 * `mcut` : Type of cutoff. Possible values are: `:mad`, `:q`. 
     See Thereafter.
 * `cri` : When `mcut` = `:mad`, a constant. See thereafter.
 * `risk` : When `mcut` = `:q`, a risk-I level. See thereafter.
 
 In this method, the outlierness `d` of an observation
-is the orthogonal distance (OD =  "X-residuals") of this 
+is the orthogonal distance (=  'X-residuals') of this 
 observation, ie. the Euclidean distance between the observation 
 and its projection on the  score plan defined by the fitted 
 (e.g. PCA) model (e.g. Hubert et al. 2005, Van Branden & Hubert 
@@ -117,8 +117,7 @@ function occod(fitm, X; kwargs...)
     par = recovkw(ParOcc, kwargs).par 
     @assert 0 <= par.risk <= 1 "Argument 'risk' must ∈ [0, 1]."
     E = xresid(fitm, X)
-    d2 = vec(sum(E .* E, dims = 2))
-    d = sqrt.(d2)
+    d = rownorm(E)
     par.mcut == :mad ? cutoff = median(d) + par.cri * madv(d) : nothing
     par.mcut == :q ? cutoff = quantile(d, 1 - par.risk) : nothing
     e_cdf = StatsBase.ecdf(d)
@@ -136,8 +135,7 @@ Compute predictions from a fitted model.
 function predict(object::Occod, X)
     E = xresid(object.fitm, X)
     m = nro(E)
-    d2 = vec(sum(E .* E, dims = 2))
-    d = sqrt.(d2)
+    d = rownorm(E)
     p_val = pval(object.e_cdf, d)
     d = DataFrame(d = d, dstand = d / object.cutoff, pval = p_val)
     pred = [if d.dstand[i] <= 1 "in" else "out" end for i = 1:m]

diff --git a/src/occsd.jl b/src/occsd.jl
@@ -2,8 +2,9 @@
     occsd(; kwargs...)
     occsd(fitm; kwargs...)
 One-class classification using PCA/PLS score distance (SD).
-* `fitm` : The preliminary model (e.g. PCA; object `fitm`) that was fitted 
-    on the training data assumed to represent the training class.
+* `fitm` : The preliminary model (e.g. object `fitm` returned by function 
+    `pcasvd`) that was fitted on the training data assumed to represent 
+    the training class.
 Keyword arguments:
 * `mcut` : Type of cutoff. Possible values are: `:mad`, `:q`. 
     See Thereafter.

diff --git a/src/occsdod.jl b/src/occsdod.jl
@@ -1,21 +1,21 @@
 """
     occsdod(; kwargs...)
     occsdod(object, X; kwargs...)
-One-class classification using a compromise between 
-    PCA/PLS score (SD) and orthogonal (OD) distances.
-* `fitm` : The preliminary model (e.g. PCA; object `fitm`) that was fitted 
-    on the training data assumed to represent the training class.
-* `X` : Training X-data (n, p), on which was fitted 
-    the model `fitm`.
+One-class classification using a consensus between 
+    PCA/PLS score and orthogonal (SD and OD) distances.
+* `fitm` : The preliminary model (e.g. object `fitm` returned by function 
+    `pcasvd`) that was fitted on the training data assumed to represent 
+    the training class.
+* `X` : Training X-data (n, p), on which was fitted the model `fitm`.
 Keyword arguments:
 * `mcut` : Type of cutoff. Possible values are: `:mad`, `:q`. 
     See Thereafter.
 * `cri` : When `mcut` = `:mad`, a constant. See thereafter.
 * `risk` : When `mcut` = `:q`, a risk-I level. See thereafter.
 
 In this method, the outlierness `d` of a given observation
-is a compromise between the score distance (SD) and the
-orthogonal distance (OD). The compromise is computed from the 
+is a consensus between the score distance (SD) and the
+orthogonal distance (OD). The consensus is computed from the 
 standardized distances by: 
 * `dstand` = sqrt(`dstand_sd` * `dstand_od`).
 
@@ -31,7 +31,7 @@ function occsdod(fitm, X; kwargs...)
     fitmod = occod(fitm, X; kwargs...)
     sd = fitmsd.d
     od = fitmod.d
-    z = sqrt.(sd.dstand .* od.dstand)
+    z = [sqrt(sd.dstand[i] * od.dstand[i]) for i in eachindex(sd.dstand)]
     nam = string.(names(sd), "_sd")
     rename!(sd, nam)
     nam = string.(names(od), "_od")
@@ -56,7 +56,7 @@ function predict(object::Occsdod, X)
     nam = string.(names(od), "_od")
     rename!(od, nam)
     d = hcat(sd, od)
-    d.dstand = sqrt.(sd.dstand_sd .* od.dstand_od)
+    d.dstand = [sqrt(sd.dstand_sd[i] * od.dstand_od[i]) for i in eachindex(sd.dstand_sd)]
     pred = [if d.dstand[i] <= 1 "in" else "out" end for i = 1:m]
     pred = reshape(pred, m, 1)
     (pred = pred, d)

diff --git a/src/pcaout.jl b/src/pcaout.jl
@@ -17,18 +17,18 @@ Keyword arguments:
 
 Robust PCA combining outlyingness measures and weighted PCA (WPCA). 
 
-Observations (`X`-rows) receive weights depending on outlyingness 
-with the objective to remove the effect of multivariate `X`-outliers 
-that have potentially bad leverages.
+The objective is to remove the effect of multivariate `X`-outliers 
+that have potentially bad leverages. Observations (`X`-rows) receive 
+weights depending on two outlyingness indicators:
 1) The Stahel-Donoho outlyingness (Maronna and Yohai, 1995) is computed 
-    (function `outstah`). The proportion `prm` of the observations with the 
+    (function `outstah`) on `X`. The proportion `prm` of the observations with the 
     highest outlyingness values receive a weight w1 = 0 (the other receive a weight w1 = 1).
-2) An outlyingness computed from the Euclidean distance to center 
+2) An outlyingness based on the Euclidean distance to center 
     (function `outstah`) is computed. The proportion `prm` of the observations 
     with the highest outlyingness values receive a weight w2 = 0 
     (the other receive a weight w2 = 1).
-3) The final weights of the observations are computed by weights.w * w1 * w2
-    that is used in a weighted PCA.
+The final weights of the observations are computed by weights.w * w1 * w2
+that is used in a weighted PCA.
 
 By default, the function uses `prm = .3` (such as in the ROBPCA algorithm 
 of Hubert et al. 2005, 2009). 
@@ -92,8 +92,7 @@ function pcaout!(X::Matrix, weights::Weight; kwargs...)
     n, p = size(X)
     nlvout = 30
     P = rand(0:1, p, nlvout)
-    d = similar(X, n)
-    d .= outstah(X, P; scal = par.scal).d
+    d = outstah(X, P; scal = par.scal).d
     w = wtal(d; a = quantile(d, 1 - par.prm))
     d .= outeucl(X; scal = par.scal).d
     w .*= wtal(d; a = quantile(d, 1 - par.prm))