Aim. This vignette shows how to fit a penalized
factor analysis model with the scad and mcp penalties using the routines
in the penfa
package. The employed penalty is aimed at
encouraging sparsity in the factor loading matrix. Since the scad and
mcp penalties cannot be used with the automatic tuning procedure (see
Geminiani et al., 2021 for details), the optimal tuning parameter will
be found through grid searches.
Data. For illustration purposes, we use the
cross-cultural data set ccdata
containing the standardized
ratings to 12 items concerning organizational citizenship behavior.
Employees from different countries were asked to rate their attitudes
towards helping other employees and giving suggestions for improved work
conditions. The items are thought to measure two latent factors:
help, defined by the first seven items (h1
to h7
), and voice, represented by the last
five items (v1
to v5
). See
?ccdata
for details.
This data set is a standardized version of the one in the ccpsyc
package, and only considers employees from Lebanon and Taiwan (i.e.,
"LEB"
, "TAIW"
). This vignette is meant as a
demo of the capabilities of penfa
; please refer to Fischer
et al. (2019) and Fischer and Karl (2019) for a description and analysis
of these data.
Let us load and inspect ccdata
.
library(penfa)
data(ccdata)
summary(ccdata)
## country h1 h2 h3 h4
## Length:767 Min. :-2.62004 Min. :-2.9034 Min. :-2.63082 Min. :-3.0441
## Class :character 1st Qu.:-0.69516 1st Qu.:-0.2163 1st Qu.:-0.70356 1st Qu.:-0.2720
## Mode :character Median :-0.05354 Median : 0.4554 Median :-0.06114 Median : 0.4211
## Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.58809 3rd Qu.: 0.4554 3rd Qu.: 0.58128 3rd Qu.: 0.4211
## Max. : 1.22971 Max. : 1.1272 Max. : 1.22370 Max. : 1.1141
## h5 h6 h7 v1 v2
## Min. :-2.9105 Min. :-2.9541 Min. :-2.8364 Min. :-2.627694 Min. :-2.674430
## 1st Qu.:-0.8662 1st Qu.:-0.9092 1st Qu.:-0.7860 1st Qu.:-0.660770 1st Qu.:-0.671219
## Median : 0.4966 Median : 0.4541 Median :-0.1025 Median :-0.005129 Median :-0.003482
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.000000 Mean : 0.000000
## 3rd Qu.: 0.4966 3rd Qu.: 0.4541 3rd Qu.: 0.5810 3rd Qu.: 0.650512 3rd Qu.: 0.664255
## Max. : 1.1781 Max. : 1.1358 Max. : 1.2645 Max. : 1.306154 Max. : 1.331992
## v3 v4 v5
## Min. :-2.65214 Min. :-2.65722 Min. :-2.51971
## 1st Qu.:-0.68800 1st Qu.:-0.68041 1st Qu.:-0.61127
## Median :-0.03329 Median :-0.02148 Median : 0.02488
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.62142 3rd Qu.: 0.63746 3rd Qu.: 0.66103
## Max. : 1.27613 Max. : 1.29639 Max. : 1.29718
Before fitting the model, we need to write a model syntax describing
the relationships between the items and the latent factors. To
facilitate its formulation, the rules for the syntax specification
broadly follow the ones required by lavaan. The syntax
must be enclosed in single quotes ' '
.
syntax = 'help =~ h1 + h2 + h3 + h4 + h5 + h6 + h7 + 0*v1 + v2 + v3 + v4 + v5
voice =~ 0*h1 + h2 + h3 + h4 + h5 + h6 + h7 + v1 + v2 + v3 + v4 + v5'
The factors help
and voice
appear on the
left-hand side, whereas the observed variables on the left-hand side.
Following the rationale in Geminiani et al. (2021), we only specify the
minimum number of identification constraints. We are setting the scales
of the factors by fixing their factor variances to 1. This can be done
in one of two ways: 1) by adding 'help ~~ 1*help'
and
'voice ~~ 1*voice'
to the syntax above; or 2) by setting
the argument std.lv = TRUE
in the fitting function (see
below). To avoid rotational freedom, we fix one loading per factor to
zero. Parameters can be easily fixed to user-defined values through the
pre-multiplication mechanism. By default, unique variances are
automatically added to the model, and the factors are allowed to
correlate. These specifications can be modified by altering the syntax
(see ?penfa
for details on how to write the model
syntax).
The core of the package is given by the penfa
function,
a short form for PENalized Factor Analysis, that implements the
framework discussed in Geminiani et al. (2021). If users decide for
either the scad
or mcp
penalties, they need to
use strategy = "fixed"
since the automatic procedure is not
feasible.
We start off with the scad. In the function call, we now specify
pen.shrink = "scad"
, and we provide through the
eta
argument the fixed value of the tuning parameter to be
employed during optimization (here, for instance, 0.05). The name given
to the starting value (here, the factor loading matrix
"lambda"
) reflects the parameter matrix to be penalized.
All of its elements are penalized, which means here that the
penalization is applied to all factor loadings (except the ones fixed
for identification). The scad penalty relies on an additional shape
parameter, which is set by default to 3.7 (Fan and Li 2001). This value
can be conveniently modified through the a.scad
argument.
See ?penfaOptions
for additional details on the possible
options.
scad.fit <- penfa(## factor model
model = syntax,
data = ccdata,
std.lv = TRUE,
## penalization
pen.shrink = "scad",
eta = list(shrink = c("lambda" = 0.05), diff = c("none" = 0)),
## fixed tuning
strategy = "fixed")
##
## Largest absolute gradient value: 27.97150845
## Fisher information matrix is positive definite
## Eigenvalue range: [164.74, 382028.4]
## Trust region iterations: 24
## Factor solution: admissible
## Computing VCOV ... done.
## Effective degrees of freedom: 27.94395
In order to find the optimal value of the tuning parameter, a grid search needs to be conducted, and the optimal model is the one with the lowest GBIC (Generalized Bayesian Information Criterion). For demo purposes, we use a grid of 51 values evenly spaced between 0 and 0.15. However, for accurate analyses, please consider finer grids (of at least 200 elements) with an upper bound reasonable for the data at hand.
# Grid of values for tuning parameter
eta.grid <- seq(0, 0.15, length.out = 51)
# Return GBIC from a converged and admissible penfa model with fixed tuning
penfa.fixedTun <- function(eta, penalty, ...){
fitted <- penfa(model = syntax, data = ccdata,
std.lv = TRUE, pen.shrink = penalty,
eta = list(shrink = c("lambda" = eta), diff = c("none" = 0)),
strategy = "fixed", verbose = FALSE, ...)
if(all(fitted@Vcov$solution) & fitted@Vcov$admissibility)
return(BIC(fitted))
}
# additional penfaOptions can be passed
GBIC.scad <- sapply(eta.grid, penfa.fixedTun, penalty = "scad")
The optimal tuning parameter is the one generating the penalized
model with the lowest GBIC
.
optimtun.scad <- eta.grid[[which.min(GBIC.scad)]]
# To plot GBIC across tuning values
# p <- plotly::plot_ly(x = eta.grid, y = GBIC.scad, type = 'scatter', mode = 'lines')
# plotly::layout(p, xaxis = list(showline = TRUE),
# title = list(text = "GBIC values across tuning parameters"))
We can ultimately fit the model with the optimal tuning parameter
(here, 0.114). The summary
method details information on
the model characteristics, the optimization and penalization procedures
as well as the parameter estimates with associated standard errors and
confidence intervals. The Type column distinguishes between the
fixed parameters set to specific values for identification, the
free parameters that have been estimated through ordinary
maximum likelihood, and the penalized (pen) parameters. The
standard errors here have been computed as the square root of the
inverse of the penalized Fisher information matrix (Geminiani et al.,
2021). The last columns report 95% confidence intervals (CI) for the
model parameters. Standard errors and CI of the penalized parameters
shrunken to zero are not displayed. A different significance level can
be specified through the level
argument in the
summary
call.
scad.fit <- penfa(## factor model
model = syntax,
data = ccdata,
std.lv = TRUE,
## penalization
pen.shrink = "scad",
# optimal tuning
eta = list(shrink = c("lambda" = optimtun.scad), diff = c("none" = 0)),
strategy = "fixed",
verbose = FALSE)
summary(scad.fit)
## penfa 0.1.1 reached convergence
##
## Number of observations 767
## Number of groups 1
## Number of observed variables 12
## Number of latent factors 2
##
## Estimator PMLE
## Optimization method trust-region
## Information fisher
## Strategy fixed
## Number of iterations 57
## Number of parameters:
## Free 13
## Penalized 22
## Effective degrees of freedom 26.424
## GIC 17224.457
## GBIC 17347.131
##
## Penalty function:
## Sparsity scad
##
## Additional tuning parameter
## scad 3.7
##
## Optimal tuning parameter:
## Sparsity
## - Factor loadings 0.114
##
##
## Parameter Estimates:
##
## Latent Variables:
## Type Estimate Std.Err 2.5% 97.5%
## help =~
## h1 pen 0.779 0.030 0.720 0.839
## h2 pen 0.875 0.029 0.818 0.931
## h3 pen 0.788 0.030 0.729 0.847
## h4 pen 0.915 0.031 0.854 0.976
## h5 pen 0.831 0.034 0.765 0.897
## h6 pen 0.812 0.036 0.740 0.883
## h7 pen 0.507 0.050 0.409 0.604
## v1 fixed 0.000 0.000 0.000
## v2 pen 0.000
## v3 pen 0.000
## v4 pen 0.000
## v5 pen -0.000
## voice =~
## h1 fixed 0.000 0.000 0.000
## h2 pen -0.002
## h3 pen 0.000
## h4 pen -0.020
## h5 pen 0.039
## h6 pen 0.082 0.026 0.031 0.132
## h7 pen 0.370 0.049 0.274 0.465
## v1 pen 0.862 0.029 0.806 0.918
## v2 pen 0.881 0.028 0.826 0.937
## v3 pen 0.852 0.029 0.795 0.909
## v4 pen 0.853 0.029 0.796 0.910
## v5 pen 0.814 0.030 0.756 0.872
##
## Covariances:
## Type Estimate Std.Err 2.5% 97.5%
## help ~~
## voice free 0.879 0.011 0.858 0.901
##
## Variances:
## Type Estimate Std.Err 2.5% 97.5%
## .h1 free 0.384 0.021 0.343 0.426
## .h2 free 0.231 0.014 0.203 0.258
## .h3 free 0.371 0.021 0.330 0.411
## .h4 free 0.186 0.012 0.162 0.210
## .h5 free 0.235 0.014 0.208 0.263
## .h6 free 0.201 0.012 0.177 0.225
## .h7 free 0.265 0.015 0.236 0.294
## .v1 free 0.244 0.015 0.214 0.273
## .v2 free 0.208 0.013 0.181 0.234
## .v3 free 0.261 0.016 0.230 0.292
## .v4 free 0.260 0.016 0.229 0.291
## .v5 free 0.326 0.019 0.289 0.363
## help fixed 1.000 1.000 1.000
## voice fixed 1.000 1.000 1.000
The penalty matrix can be inspected and plotted as shown in “plotting-penalty-matrix”.
The implied moments (here, the covariance matrix) can be found via
the fitted
method.
implied <- fitted(scad.fit)
implied
## $cov
## h1 h2 h3 h4 h5 h6 h7 v1 v2 v3 v4 v5
## h1 0.992
## h2 0.680 0.993
## h3 0.614 0.688 0.991
## h4 0.699 0.783 0.707 0.992
## h5 0.674 0.755 0.682 0.776 0.984
## h6 0.688 0.771 0.696 0.792 0.765 0.983
## h7 0.648 0.726 0.655 0.745 0.723 0.742 0.988
## v1 0.591 0.661 0.597 0.676 0.663 0.686 0.703 0.987
## v2 0.604 0.676 0.611 0.692 0.678 0.701 0.719 0.760 0.985
## v3 0.584 0.654 0.590 0.669 0.656 0.678 0.695 0.735 0.751 0.987
## v4 0.584 0.654 0.591 0.669 0.656 0.678 0.695 0.735 0.752 0.727 0.987
## v5 0.558 0.624 0.564 0.639 0.626 0.647 0.663 0.702 0.717 0.693 0.694 0.989
Lastly, the factor scores can be calculated via the
penfaPredict
function.
We can fit a penalized factor model with the mcp penalty in a way
similar to the scad. By default the shape parameter of the mcp is set to
3. This value can be conveniently modified through the
a.mcp
argument. See ?penfaOptions
for
additional details on the possible options.
GBIC.mcp <- sapply(eta.grid, penfa.fixedTun, penalty = "mcp")
optimtun.mcp <- eta.grid[[which.min(GBIC.mcp)]]
optimtun.mcp
## [1] 0.141
The tuning value equal to 0.141 generated the model with the lowest GBIC.
mcp.fit <- penfa(## factor model
model = syntax,
data = ccdata,
std.lv = TRUE,
## penalization
pen.shrink = "mcp",
# optimal tuning
eta = list(shrink = c("lambda" = optimtun.mcp), diff = c("none" = 0)),
strategy = "fixed",
verbose = FALSE)
summary(mcp.fit)
## penfa 0.1.1 reached convergence
##
## Number of observations 767
## Number of groups 1
## Number of observed variables 12
## Number of latent factors 2
##
## Estimator PMLE
## Optimization method trust-region
## Information fisher
## Strategy fixed
## Number of iterations 39
## Number of parameters:
## Free 13
## Penalized 22
## Effective degrees of freedom 26.428
## GIC 17224.179
## GBIC 17346.872
##
## Penalty function:
## Sparsity mcp
##
## Additional tuning parameter
## mcp 3
##
## Optimal tuning parameter:
## Sparsity
## - Factor loadings 0.141
##
##
## Parameter Estimates:
##
## Latent Variables:
## Type Estimate Std.Err 2.5% 97.5%
## help =~
## h1 pen 0.779 0.030 0.720 0.839
## h2 pen 0.874 0.029 0.818 0.931
## h3 pen 0.788 0.030 0.729 0.847
## h4 pen 0.914 0.030 0.854 0.974
## h5 pen 0.828 0.033 0.762 0.893
## h6 pen 0.807 0.037 0.735 0.879
## h7 pen 0.506 0.050 0.408 0.604
## v1 fixed 0.000 0.000 0.000
## v2 pen 0.000
## v3 pen 0.000
## v4 pen 0.000
## v5 pen -0.000
## voice =~
## h1 fixed 0.000 0.000 0.000
## h2 pen -0.001
## h3 pen 0.000
## h4 pen -0.018
## h5 pen 0.042
## h6 pen 0.087 0.027 0.035 0.139
## h7 pen 0.371 0.049 0.275 0.467
## v1 pen 0.862 0.029 0.806 0.919
## v2 pen 0.882 0.028 0.826 0.937
## v3 pen 0.852 0.029 0.795 0.909
## v4 pen 0.853 0.029 0.796 0.910
## v5 pen 0.814 0.030 0.756 0.872
##
## Covariances:
## Type Estimate Std.Err 2.5% 97.5%
## help ~~
## voice free 0.879 0.011 0.857 0.900
##
## Variances:
## Type Estimate Std.Err 2.5% 97.5%
## .h1 free 0.384 0.021 0.343 0.426
## .h2 free 0.231 0.014 0.203 0.258
## .h3 free 0.370 0.021 0.330 0.411
## .h4 free 0.186 0.012 0.162 0.210
## .h5 free 0.236 0.014 0.208 0.263
## .h6 free 0.201 0.012 0.177 0.226
## .h7 free 0.265 0.015 0.236 0.294
## .v1 free 0.244 0.015 0.214 0.273
## .v2 free 0.208 0.013 0.181 0.234
## .v3 free 0.261 0.016 0.230 0.292
## .v4 free 0.260 0.016 0.229 0.291
## .v5 free 0.326 0.019 0.289 0.363
## help fixed 1.000 1.000 1.000
## voice fixed 1.000 1.000 1.000
Implementing the above approach in the multiple-group case would
imply carrying out grid searches in three dimensions to find the optimal
tuning parameter vector. This is clearly not advisable, as it would
introduce further complications and possibly new computational problems
and instabilities. For this reason, we suggest users to rely on the
automatic tuning parameter selection procedure, whose applicability is
demonstrated in the vignettes for single
(vignette("automatic-tuning-selection")
) and multiple-group
(“multiple-group-analysis”)
penalized models.
sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] penfa_0.1.1 rmarkdown_2.29
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 pspline_1.0-21 ismev_1.42
## [4] xfun_0.50 bslib_0.9.0 ggplot2_3.5.1
## [7] psych_2.4.12 GJRM_0.2-6.7 lattice_0.22-6
## [10] numDeriv_2016.8-1.1 vctrs_0.6.5 tools_4.4.2
## [13] generics_0.1.3 stats4_4.4.2 parallel_4.4.2
## [16] tibble_3.2.1 pkgconfig_2.0.3 Matrix_1.7-2
## [19] lifecycle_1.0.4 scam_1.2-18 compiler_4.4.2
## [22] munsell_0.5.1 mnormt_2.1.1 mitools_2.4
## [25] survey_4.4-2 htmltools_0.5.8.1 sys_3.4.3
## [28] buildtools_1.0.0 sass_0.4.9 evd_2.3-7.1
## [31] yaml_2.3.10 gmp_0.7-5 pillar_1.10.1
## [34] jquerylib_0.1.4 MASS_7.3-64 cachem_1.1.0
## [37] trust_0.1-8 abind_1.4-8 nlme_3.1-167
## [40] tidyselect_1.2.1 digest_0.6.37 mvtnorm_1.3-3
## [43] dplyr_1.1.4 maketools_1.3.2 splines_4.4.2
## [46] magic_1.6-1 pcaPP_2.0-5 gsl_2.1-8
## [49] gamlss.dist_6.1-1 fastmap_1.2.0 grid_4.4.2
## [52] copula_1.1-5 colorspace_2.1-1 cli_3.6.4
## [55] magrittr_2.0.3 survival_3.8-3 Rmpfr_1.0-0
## [58] scales_1.3.0 distrEx_2.9.6 VineCopula_2.6.0
## [61] startupmsg_1.0.0 matrixStats_1.5.0 VGAM_1.1-13
## [64] evaluate_1.0.3 knitr_1.49 distr_2.9.7
## [67] ADGofTest_0.3 stabledist_0.7-2 mgcv_1.9-1
## [70] rlang_1.1.5 Rcpp_1.0.14 glue_1.8.0
## [73] DBI_1.2.3 jsonlite_1.8.9 R6_2.6.1
## [76] sfsmisc_1.1-20
Fan, J., & Li, R. 2001. “Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties.” Journal of the American Statistical Association 96(456), 1348–60. https://doi.org/10.1198/016214501753382273
Fischer, R., Ferreira, M. C., Van Meurs, N. et al. (2019). “Does Organizational Formalization Facilitate Voice and Helping Organizational Citizenship Behaviors? It Depends on (National) Uncertainty Norms.” Journal of International Business Studies, 50(1), 125-134. https://doi.org/10.1057/s41267-017-0132-6
Fischer, R., & Karl, J. A. (2019). “A Primer to (Cross-Cultural) Multi-Group Invariance Testing Possibilities in R.” Frontiers in psychology, 10, 1507. https://doi.org/10.3389/fpsyg.2019.01507
Geminiani, E. (2020). “A Penalized Likelihood-Based Framework for Single and Multiple-Group Factor Analysis Models.” PhD thesis, University of Bologna. http://amsdottorato.unibo.it/9355/
Geminiani, E., Marra, G., & Moustaki, I. (2021). “Single- and Multiple-Group Penalized Factor Analysis: A Trust-Region Algorithm Approach with Integrated Automatic Multiple Tuning Parameter Selection.” Psychometrika, 86(1), 65-95. https://doi.org/10.1007/s11336-021-09751-8