refs:
Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. […] Factor analysis searches for such joint variations in response to unobserved latent(*) variables. The observed variables are modelled as linear combinations of the potential factors, plus “error” terms. The information gained about the interdependencies between observed variables can be used later to reduce the set of variables in a dataset. wikipedia.
(*) latent variables (as opposed to observable variables), are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured) wikipedia Since factor are latent we cannot use methods like regression.
So, we want to investigate if observable variables \( X_1, X_2\ldots X_N \) are linearly related to a small number of unobservable (latent) factors \( F_1, F_2\ldots F_K \), with \( K << N \).
\[ \begin{array}{l} X_1 = w_{10} + w_{11} F_1 + \ldots w_{1K} F_K + e_1 \\ X_2 = w_{20} + w_{21} F_1 + \ldots w_{2K} F_K + e_2 \\ \ldots \\ X_N = w_{N0} + w_{N1} F_1 + \ldots w_{NK} F_K + e_N \\ \end{array} \]
where \( e_i \) are error terms, so the relation hypothesis are not exact.
The parameters \( w_{ij} \) are called loadings.
There are the following assumptions:
(A2) is stating that these latent variables do not influence one another, which might be too strong a condition. There are more advanced models where this is relaxed (there's an eg below).
With the loadings it is possible to compute the covariance of any two observed variables:
\[ Cov(X_i, X_j) = \sum_{k=1}^K w_{ik}w_{jk} \]
and the variance of each \( X_i \):
\[ Var(X_i) = \sum_{k=1}^K w_{ik} + \sigma_i^2 \]
where the sum is the communality of the variable, the variance explained by the common factors \( F_k \). The remaining variance is called specific variance.
With all these values (if we had the loadings, which we do not), we could build a theoretical variace covariance matrix \( \Omega \). What we have is the data, and we can use it to compute the observable variace covariance matrix \( S \). If the model assumptions hold, then it is possible to estimate the loadings \( w_{ij} \) of \( \Omega \) given \( S \). The standard way is to use principal components that uses eigenvalues and eigenvectors to bring the estimate of the total communality as close as possible to the total of the observed variances (check example 2).
Just one extra note. The values of the loadings are not unique (in fact they are infinite). This means that if the algorithm finds one solution that does not reveal the hypothesized structure of the problem, it is possible to apply a 'rotation' to find another set of loadings that might provide a better interpretation or more consistent with prior expectations about the dataset.
There are a number of rotations in the literature´. Some egs:
Rotation methods can be described as orthogonal, which do not allow the resulting factors to be correlated, and oblique, which do allow the resulting factors to be correlated.
Both methods have the aim of reducing the dimensionality of a vector of random variables. But while Factor Analysis assumes a model (that may fit the data or not), PCA is just a data transformation and for this reason it always exists. Furthermore while Factor Analysis aims at explaining (covariances) or correlations, PCA concentrates on variances. http://www2.stat.unibo.it/montanari/Didattica/Multivariate/CA.pdf
Our sample dataset contains a hypothetical sample of 300 responses on 6 items from a survey of college students' favorite subject matter. The items range in value from 1 to 5, which represent a scale from Strongly Dislike to Strongly Like. Our 6 items asked students to rate their liking of different college subject matter areas, including biology (BIO), geology (GEO), chemistry (CHEM), algebra (ALG), calculus (CALC), and statistics (STAT).
my.data <- read.csv("dataset_EFA.csv")
# if data as NAs, it is better to omit them:
# my.data <- na.omit(my.data)
head(my.data)
BIO GEO CHEM ALG CALC STAT
1 1 1 1 1 1 1
2 4 4 3 4 4 4
3 2 1 3 4 1 1
4 2 3 2 4 4 3
5 3 1 2 2 3 4
6 1 1 1 4 4 4
Package stats
has a function factanal()
can be used to perform factor analysis:
n.factors <- 2
fit <- factanal(my.data,
n.factors, # number of factors to extract
scores=c("regression"),
rotation="none")
print(fit, digits=2, cutoff=.3, sort=TRUE)
Call:
factanal(x = my.data, factors = n.factors, scores = c("regression"), rotation = "none")
Uniquenesses:
BIO GEO CHEM ALG CALC STAT
0.25 0.37 0.25 0.37 0.05 0.71
Loadings:
Factor1 Factor2
ALG 0.78
CALC 0.97
STAT 0.53
BIO 0.30 0.81
GEO 0.74
CHEM 0.84
Factor1 Factor2
SS loadings 2.06 1.93
Proportion Var 0.34 0.32
Cumulative Var 0.34 0.66
Test of the hypothesis that 2 factors are sufficient.
The chi square statistic is 2.94 on 4 degrees of freedom.
The p-value is 0.568
head(fit$scores)
Factor1 Factor2
[1,] -1.9090 -0.52367
[2,] 0.9564 0.89250
[3,] -1.5798 0.33785
[4,] 0.7909 -0.28206
[5,] -0.1128 -0.03603
[6,] 0.6902 -1.31362
# plot factor 1 by factor 2
load <- fit$loadings[,1:2]
plot(load,type="n") # set up plot
text(load,labels=names(my.data),cex=.7) # add variable names
The output maximizes variance for the 1st and subsequent factors, while all are orthogonal to each other.
Rotation serves to make the output more understandable, by seeking so-called “Simple Structure”: A pattern of loadings where items load most strongly on one factor, and much more weakly on the other factors. Eg, varimax
rotation is an orthogonal rotation of the factor axes to maximize the variance of the squared loadings of a factor (column) on all the variables (rows) in a factor matrix, which has the effect of differentiating the original variables by extracted factor. Each factor will tend to have either large or small loadings of any particular variable. A varimax solution yields results which make it as easy as possible to identify each variable with a single factor. This is the most common rotation option. wikipedia.
fit <- factanal(my.data,
n.factors, # number of factors to extract
rotation="varimax") # 'varimax' is an ortho rotation
load <- fit$loadings[,1:2]
load
Factor1 Factor2
BIO 0.85456 0.13257
GEO 0.77933 0.13455
CHEM 0.86461 0.05545
ALG 0.03133 0.79071
CALC 0.09672 0.97108
STAT 0.16998 0.50612
plot(load,type="n") # set up plot
text(load,labels=names(my.data),cex=.7) # add variable names
Looking at both plots we see that the courses og Geology, Biology and Chemistry all have high factor loadings around 0.8 on the first factor (PA1) while Calculus, Algebra and Statistics load highly on the second factor (PA2). We could rename PA1 as Science, and PA2 as Math.
Note that STAT has a much lower loading on PA2 than ALG or CALC and that it has a slight loading on factor PA1. This suggests that statistics is less related to the concept of Math than Algebra and Calculus. Just below the loadings table, we can see that each factor accounted for around 30% of the variance in responses, leading to a factor solution that accounted for 66% of the total variance in students' subject matter preference.
We could also try an oblique rotation (Stats might share some with the Science factor). Since the previous command does not implement an oblique rotation, we'll use package psych
:
# install.packages("psych")
# install.packages("GPArotation")
library(psych)
solution <- fa(r = cor(my.data), nfactors = 2, rotate = "oblimin", fm = "pa")
plot(solution,labels=names(my.data),cex=.7, ylim=c(-.1,1))
solution
Factor Analysis using method = pa
Call: fa(r = cor(my.data), nfactors = 2, rotate = "oblimin", fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
PA1 PA2 h2 u2 com
BIO 0.86 0.02 0.75 0.255 1.0
GEO 0.78 0.05 0.63 0.369 1.0
CHEM 0.87 -0.05 0.75 0.253 1.0
ALG -0.04 0.81 0.65 0.354 1.0
CALC 0.01 0.96 0.92 0.081 1.0
STAT 0.13 0.50 0.29 0.709 1.1
PA1 PA2
SS loadings 2.14 1.84
Proportion Var 0.36 0.31
Cumulative Var 0.36 0.66
Proportion Explained 0.54 0.46
Cumulative Proportion 0.54 1.00
With factor correlations of
PA1 PA2
PA1 1.00 0.21
PA2 0.21 1.00
Mean item complexity = 1
Test of the hypothesis that 2 factors are sufficient.
The degrees of freedom for the null model are 15 and the objective function was 2.87
The degrees of freedom for the model are 4 and the objective function was 0.01
The root mean square of the residuals (RMSR) is 0.01
The df corrected root mean square of the residuals is 0.03
Fit based upon off diagonal values = 1
Measures of factor score adequacy
PA1 PA2
Correlation of scores with factors 0.94 0.96
Multiple R square of scores with factors 0.88 0.93
Minimum correlation of possible factor scores 0.77 0.86
Notice that our factors are correlated at 0.21. The choice of an oblique rotation allowed for the recognition of this relationship.
A crucial decision in exploratory factor analysis is how many factors to extract.
This next plot is the Cattell scree plot. It plots the components/factors as the X axis and the corresponding eigenvalues as the Y-axis. As one moves to the right, toward later components, the eigenvalues drop. When the drop ceases and the curve makes an elbow toward less steep decline, Cattell's scree test says to drop all further components after the one starting the elbow. wikipedia
# install.packages("psy")
library(psy)
scree.plot(fit$correlation)
Another package, nFactors
, also offers a suite of functions to aid in this decision. The Kaiser-Guttman rule says that we should choose all factors with eigenvalue greater than \( 1 \). More methods can be found here.
# Determine Number of Factors to Extract
# install.packages("nFactors")
library(nFactors)
ev <- eigen(cor(my.data)) # get eigenvalues
ap <- parallel(subject=nrow(my.data),var=ncol(my.data), rep=100, cent=.05)
nS <- nScree(x=ev$values, aparallel=ap$eigen$qevpea)
plotnScree(nS)
From http://www2.stat.unibo.it/montanari/Didattica/Multivariate/FA_lab.pdf
The next file shows correlation coefficients between subject scores for a sample of 220 boys.
sub_marks<-read.csv("sub_marks.csv",header=TRUE,sep=";")
sub_marks
X Gaelic English History Arithmetic Algebra Geometry
1 Gaelic 1.00 0.44 0.41 0.29 0.33 0.25
2 English 0.44 1.00 0.35 0.35 0.32 0.33
3 History 0.41 0.35 1.00 0.16 0.19 0.18
4 Arithmetic 0.29 0.35 0.16 1.00 0.59 0.47
5 Algebra 0.33 0.32 0.19 0.59 1.00 0.46
6 Geometry 0.25 0.33 0.18 0.47 0.46 1.00
Each subject score is positively correlated with each of the scores in the other subjects, indicating that there is a general tendency for those who do well in one subject to do well in others. The highest correlations are between the three mathematical subjects and to a slightly lesser extent, between the three humanities subjects, suggesting that there is more in common within each of these two groups than between them.
In order to reduce the dimension of the problem and to explain the observed correlations through some related latent factors we fit a factor model using the principal factor method.
First of all we need to compute an initial estimate of the communalities by calculating the multiple correlation coefficient of each variable with the remaining ones. We obtain it as a function of the diagonal elements of the inverse correlation matrix.
R <- as.matrix(sub_marks[,-1])
icm <- solve(R)
and then estimate the communalities
h2.zero <- round(1 -1/(diag(icm)), 2)
h2.zero
[1] 0.30 0.30 0.21 0.42 0.41 0.29
Now we can compute the reduced correlation matrix by substituting the estimated communalities into the diagonal elements (the 1's) of the original correlation matrix.
R.psi <- R
for (i in 1:nrow(R.psi)){
R.psi[i,i] <- h2.zero[i]
}
R.psi
Gaelic English History Arithmetic Algebra Geometry
[1,] 0.30 0.44 0.41 0.29 0.33 0.25
[2,] 0.44 0.30 0.35 0.35 0.32 0.33
[3,] 0.41 0.35 0.21 0.16 0.19 0.18
[4,] 0.29 0.35 0.16 0.42 0.59 0.47
[5,] 0.33 0.32 0.19 0.59 0.41 0.46
[6,] 0.25 0.33 0.18 0.47 0.46 0.29
R.psi is still squared and symmetric, but it is not positive definite. Its decomposition through the spectral theorem shows that only two eigenvalues are positive. This means that two factors might be enough in order to explain the observed correlations.
eigen(R.psi)
$values
[1] 2.0669 0.4319 -0.2019 -0.1731 -0.1199 -0.0739
$vectors
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] -0.3895 -0.4631 0.5199 -0.49657 -0.08042 0.3323
[2,] -0.4068 -0.3305 -0.3985 -0.02029 -0.57630 -0.4837
[3,] -0.2898 -0.5460 -0.1642 0.56060 0.51252 0.1180
[4,] -0.4659 0.4148 0.4151 0.55776 -0.32995 0.1362
[5,] -0.4674 0.3628 -0.5745 -0.27396 0.12767 0.4780
[6,] -0.4040 0.2729 0.2039 -0.22930 0.52304 -0.6282
The estimate of the factor loading matrix will then be obtained as \( L = \Gamma L_1^{1/2} \), where \( \Gamma \) has in columns the first two eigenvectors and \( L_1 \) has on the diagonal the first two eigenvalues.
L1 <- diag(eigen(R.psi)$values[1:2])
L1
[,1] [,2]
[1,] 2.067 0.0000
[2,] 0.000 0.4319
Gamma <- eigen(R.psi)$vectors[,1:2]
Gamma
[,1] [,2]
[1,] -0.3895 -0.4631
[2,] -0.4068 -0.3305
[3,] -0.2898 -0.5460
[4,] -0.4659 0.4148
[5,] -0.4674 0.3628
[6,] -0.4040 0.2729
We can now compute the estimated factor loadings
L <- Gamma %*% sqrt(L1)
rownames(L) <- names(sub_marks)[-1] # just name the rows of easier reading
L
[,1] [,2]
Gaelic -0.5600 -0.3044
English -0.5849 -0.2172
History -0.4167 -0.3588
Arithmetic -0.6699 0.2726
Algebra -0.6720 0.2384
Geometry -0.5808 0.1793
The first factor seems to measure overall ability in the six subjects, while the second contrasts humanities and mathematics subjects.
Communalities are, for each variable, the part of its variance that is explained by the common factors. To estimate the communalities we need to sum the square of the factor loadings for each subject:
communality <- diag(L%*%t(L))
communality
Gaelic English History Arithmetic Algebra Geometry
0.4062 0.3892 0.3024 0.5231 0.5085 0.3694
These shows, for example, that the 40% of variance in Gaelic scores is explained by the two common factors. Of course, the larger the communality the better does the variable serve as an indicator of the associated factors.
To evaluate the goodness of fit of this model we can compute the residual correlation matrix \( R - L L^T \)
R - L%*%t(L)
Gaelic English History Arithmetic Algebra Geometry
[1,] 0.593810 0.04640 0.067473 -0.002139 0.026236 -0.020640
[2,] 0.046401 0.61077 0.028381 0.017423 -0.021272 0.029273
[3,] 0.067473 0.02838 0.697631 -0.021288 -0.004469 0.002358
[4,] -0.002139 0.01742 -0.021288 0.476943 0.074830 0.032069
[5,] 0.026236 -0.02127 -0.004469 0.074830 0.491540 0.026954
[6,] -0.020640 0.02927 0.002358 0.032069 0.026954 0.630550
Since the elements out of the diagonal are fairly small and close to zero we can conclude that the model fits adequately the data.
The following correlation matrix are from 10 different intelligence tests between scores of 75 students.
cor.m <- as.matrix(read.csv("qi_test.csv",header=FALSE,sep=";"))
cor.m
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
[1,] 1.000 0.755 0.592 0.532 0.627 0.460 0.407 0.387 0.461 0.459
[2,] 0.755 1.000 0.644 0.520 0.617 0.497 0.511 0.417 0.406 0.583
[3,] 0.592 0.644 1.000 0.388 0.529 0.449 0.436 0.428 0.412 0.602
[4,] 0.532 0.528 0.388 1.000 0.475 0.442 0.280 0.214 0.361 0.424
[5,] 0.627 0.617 0.529 0.475 1.000 0.398 0.373 0.372 0.350 0.433
[6,] 0.460 0.497 0.449 0.442 0.398 1.000 0.545 0.446 0.366 0.575
[7,] 0.407 0.511 0.436 0.280 0.373 0.545 1.000 0.542 0.308 0.590
[8,] 0.387 0.417 0.428 0.214 0.372 0.446 0.542 1.000 0.375 0.654
[9,] 0.461 0.406 0.412 0.361 0.355 0.366 0.308 0.375 1.000 0.502
[10,] 0.459 0.583 0.602 0.424 0.433 0.575 0.590 0.654 0.502 1.000
By looking at the correlation matrix one can see a strong correlation between the 10 tests: all the correlation values are positive and mostly varies between 0.4-0.6
Let's factor analysis according to a maximum likelihood approach:
res <- factanal(covmat=cor.m, factors=2, n.obs=75, rotation="none")
res
Call:
factanal(factors = 2, covmat = cor.m, n.obs = 75, rotation = "none")
Uniquenesses:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
0.215 0.249 0.452 0.622 0.482 0.553 0.534 0.481 0.679 0.177
Loadings:
Factor1 Factor2
[1,] 0.789 -0.403
[2,] 0.834 -0.234
[3,] 0.740
[4,] 0.587 -0.185
[5,] 0.676 -0.247
[6,] 0.654 0.140
[7,] 0.641 0.235
[8,] 0.630 0.351
[9,] 0.564
[10,] 0.807 0.414
Factor1 Factor2
SS loadings 4.872 0.685
Proportion Var 0.487 0.069
Cumulative Var 0.487 0.556
Test of the hypothesis that 2 factors are sufficient.
The chi square statistic is 16.51 on 26 degrees of freedom.
The p-value is 0.923
Record the percentage of variability in each variable that is explained by the model (communalities):
round( apply(res$loadings^2, 1, sum), 3) # sum each row squared
[1] 0.785 0.751 0.548 0.378 0.518 0.447 0.466 0.519 0.321 0.823
Rotate the factors with VARIMAX. Such a rotation works on the factor loadings increasing the differences between lower weights, letting them converge to zero, and the higher weights, letting them converge to one.
res.rot<-factanal(covmat=cor.m, factors=2, n.obs=75, rotation="varimax")
res.rot
Call:
factanal(factors = 2, covmat = cor.m, n.obs = 75, rotation = "varimax")
Uniquenesses:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
0.215 0.249 0.452 0.622 0.482 0.553 0.534 0.481 0.679 0.177
Loadings:
Factor1 Factor2
[1,] 0.852 0.245
[2,] 0.769 0.399
[3,] 0.563 0.481
[4,] 0.555 0.266
[5,] 0.662 0.281
[6,] 0.382 0.549
[7,] 0.308 0.609
[8,] 0.220 0.686
[9,] 0.375 0.424
[10,] 0.307 0.854
Factor1 Factor2
SS loadings 2.904 2.653
Proportion Var 0.290 0.265
Cumulative Var 0.290 0.556
Test of the hypothesis that 2 factors are sufficient.
The chi square statistic is 16.51 on 26 degrees of freedom.
The p-value is 0.923