
Classification with Parallel Factor Analysis
cpfa.RdFits Richard A. Harshman's Parallel Factor Analysis-1 (Parafac) model or Parallel Factor Analysis-2 (Parafac2) model to a three-way or four-way data array. Also fits a principal component analysis model (PCA) to a two-way matrix. Allows for different constraint options on multiple tensor modes. For PCA, allows for either an unrotated or varimax rotated solution. Uses component weights from a single mode of the selected component model as predictors to tune parameters for one or more classification methods via a k-fold cross-validation procedure. Allows for additional features to be included. Predicts class labels and calculates multiple performance measures for binary or multiclass classification across multiple replications with different train-test splits. Provides descriptive statistics to pool output across replications.
Usage
cpfa(x, y, z = NULL, model = c("parafac", "parafac2", "pca"), nfac = 1,
nrep = 5, ratio = 0.8, nfolds = 10, align = FALSE,
method = c("PLR", "SVM", "RF", "NN", "RDA", "GBM"),
family = c("binomial", "multinomial"), parameters = list(),
type.out = c("measures", "descriptives"), foldid = NULL, prior = NULL,
cmode = NULL, seeds = NULL, plot.out = FALSE, plot.measures = NULL,
parallel = FALSE, cl = NULL, verbose = TRUE, compscale = TRUE,
pcarot = c("unrotated", "varimax"), ...)Arguments
- x
A three-way or four-way data array. For Parafac2, can be a list where each element is a matrix or three-way array. Array or list must contain only real numbers. See note below. For PCA, can be a two-way matrix, a three-way array, or a four-way array. If a three-way or four-way array is provided for PCA, array is unfolded across all modes except the classification mode. Note that if a three-way or four-way array for PCA, classification mode must be the last mode.
- y
A vector containing at least two unique class labels. Should be a factor that contains two or more levels. For binary case, ensure the order of factor levels (left to right) is such that negative class is first and positive class is second. Levels should be sequential integers starting from 0, 1, ..., etc.
- z
A matrix containing additional features for classification. Must have number of rows equal to the number of class labels (i.e., equal to the length of
y).- model
Character designating the component model to use, including:
model = "parafac"to fit the Parafac model,model = "parafac2"to fit the Parafac2 model, ormodel = "pca"to fit the PCA model.- nfac
Vector containing integers that specify the number of components for each component model to fit. Default is
nfac = 1.- nrep
Number of replications to repeat the procedure. Default is
nrep = 5.- ratio
Split ratio for dividing data into train and test sets. Default is
ratio = 0.8.- nfolds
Numeric value specifying number of folds for k-fold cross-validation. Must be 2 or greater. Default is
nfolds = 10.- align
Logical value indicating whether to solve the permutation (order) indeterminacy for Parafac or Parafac2 models fit across train-test splits. Uses Tucker congruence coefficients to compare model weights across splits, and uses the Hungarian algorithm to identify the order that matches best. Implements the algorithm through the function
solve_LSAPfrom the clue package. Finds the best target model among all splits, and aligns other models to that target. Repeats for each value innfac. Only performs permutation alignment fornfacvalues of 2 or more.- method
Character vector indicating classification methods to use. Possible methods include penalized logistic regression (PLR); support vector machine (SVM); random forest (RF); feed-forward neural network (NN); regularized discriminant analysis (RDA); and gradient boosting machine (GBM). If none are selected, default is to use all methods with
method = c("PLR", "SVM", "RF", "NN", "RDA", "GBM").- family
Character value specifying binary classification (
family = "binomial") or multiclass classification (family = "multinomial"). If not provided, number of levels of inputyis used, where two levels is binary, and where three or more levels is multiclass.- parameters
List containing arguments related to classification methods. When specified, must contain one or more of the following:
- alpha
Values for penalized logistic regression alpha parameter; default is
alpha = seq(0, 1, length = 6). Must be numeric and contain only real numbers between 0 and 1, inclusive.- lambda
Optional user-supplied lambda sequence for
cv.glmnetfor penalized logistic regression. Default is NULL.- cost
Values for support vector machine cost parameter; default is
cost = c(1, 2, 4, 8, 16, 32, 64). Must be numeric and contain only real numbers greater than 0.- gamma
Values for support vector machine gamma parameter; default is
gamma = c(0, 0.01, 0.1, 1, 10, 100, 1000). Must be numeric and greater than or equal to 0.- ntree
Values for random forest number of trees parameter; default is
ntree = c(100, 200, 400, 600, 800, 1600, 3200). Must be numeric and contain only integers greater than or equal to 1.- nodesize
Values for random forest node size parameter; default is
nodesize = c(1, 2, 4, 8, 16, 32, 64). Must be numeric and contain only integers greater than or equal to 1.- size
Values for neural network size parameter; default is
size = c(1, 2, 4, 8, 16, 32, 64). Must be numeric and contain only integers greater than or equal to 0.- decay
Values for neural network decay parameter; default is
decay = c(0.001, 0.01, 0.1, 1, 2, 4, 8, 16). Must be numeric and contain only real numbers.- rda.alpha
Values for regularized discriminant analysis alpha parameter; default is
rda.alpha = seq(0, 0.999, length = 6). Must be numeric and contain only real numbers between 0 (inclusive) and 1 (exclusive).- delta
Values for regularized discriminant analysis delta parameter; default is
delta = c(0, 0.1, 1, 2, 3, 4). Must be numeric and contain only real numbers greater than or equal to 0.- eta
Values for gradient boosting machine eta parameter; default is
eta = c(0.1, 0.3, 0.5, 0.7, 0.9). Must be numeric and contain only real numbers greater than 0 and less than 1.- max.depth
Values for gradient boosting machine max.depth parameter; default is
max.depth = c(1, 2, 3, 4). Must be numeric and contain only integers greater than or equal to 1.- subsample
Values for gradient boosting machine subsample parameter; default is
subsample = c(0.6, 0.7, 0.8, 0.9). Must be numeric and contain only real numbers greater than 0 and less than or equal to 1.- nrounds
Values for gradient boosting machine nrounds parameter; default is
nrounds = c(100, 200, 300, 500). Must be numeric and contain only integers greater than or equal to 1.
- type.out
Type of output desired:
type.out = "measures"returns an array containing classification performance measures for all replications whiletype.out = "descriptives"returns the list of descriptive statistics calculated across all replications for each performance measure. Both options also provide the estimated training weights and classification weights. Defaults totype.out = "descriptives".- foldid
List containing, for each replication, an integer vector. Should have length equal to 'nrep'. Each list element contains fold IDs for k-fold cross-validation. If not provided, fold IDs are generated randomly for the number of folds
nfolds.- prior
Prior probabilities of class membership. If specified, the probabilities should be in the order of the factor levels of input
y. If unspecified, the observed class proportions for inputyare used. Based onprior, inverse probability weights are calculated to account for class imbalance. Note that RF and RDA ignorepriorand use uniform priors to handle imbalance.- cmode
Integer value of 1, 2, or 3 (or 4 if
xis a four-way array) specifying the mode whose component weights will be predictors for classification. Defaults to the last mode of the input array (i.e., defaults to 3 for a three-way array, and to 4 for a four-way array). Ifmodel = "parafac2", the last mode will be used. Ifmodel = "pca", and if inputxis a matrix, cmode can be 1 or 2; if not provided, defaults to 1. However, ifmodel = "pca", and if inputxis a 3-way or 4-way array,cmodemust be the last mode when supplied.- seeds
Random seeds to be associated with each replication. Default is
seeds = 1:nrep.- plot.out
Logical indicating whether to output one or more box plots of classification performance measures that are plotted across classification methods and number of components.
- plot.measures
Character vector containing values that specify for plotting one or more of 11 possible classification performance measures. Only relevant when
plot.out = TRUE. Should contain one or more of the following labels:c("err", "acc", "tpr", "fpr", "tnr", "fnr", "ppv", "npv", "fdr", "fom", "fs"). A box plot will be created for each measure that is specified, summarizing output across replications. Note that additional information about each label is available in the Details section of the help file for functioncpm. If NULL, defaults to classification accuracy.- parallel
Logical indicating if parallel computing should be implemented. Defaults to FALSE, which implements sequential computing.
- cl
Cluster for parallel computing, which is used when
parallel = TRUE. Note that ifparallel = TRUEandcl = NULL, then the cluster is defined asmakeCluster(max(1L, detectCores() - 1L)).- verbose
If TRUE, progress is printed.
- compscale
Logical indicating whether to scale each column of the estimated classification component weights matrix (i.e., the features/predictors). If
TRUE, each column is scaled to have mean zero and unit variance. IfFALSE, no scaling is performed.- pcarot
Character indicating whether to apply a varimax rotation or leave PCA loadings unrotated when
modelis set topca. Ignored whenmodelis notpca. Defaults to"unrotated"if not specified.- ...
Additional arguments to be passed to function
parafacfor fitting a Parafac model or functionparafac2for fitting a Parafac2 model. Example: can impose different constraints on different modes of the input array using the argumentconst. See help file for functionparafacor for functionparafac2for additional details.
Details
Data are split into a training set and a testing set. After fitting a Parafac or
Parafac2 model with the training set using package multiway (see
parafac or parafac2 in multiway for details), or after
fitting a PCA model using the singular value decomposition, the
estimated classification mode weight matrix is passed to one or more
classification methods. The methods include: penalized logistic regression
(PLR); support vector machine (SVM); random forest (RF); feed-forward neural
network (NN); regularized discriminant analysis (RDA); and gradient boosting
machine (GBM).
Package glmnet fits models for PLR. PLR tunes penalty parameter lambda
while the elastic net parameter alpha is set by the user (see the help file
for function cv.glmnet in package glmnet). For SVM, package
e1071 is used with a radial basis kernel. Penalty parameter cost and
radial basis parameter gamma are used (see svm in package e1071).
For RF, package randomForest is used and implements Breiman's random
forest algorithm. The number of predictors sampled at each node split is set
at the default of sqrt(R), where R is the number of model components plus the
number of additional features from argument z. Two tuning parameters
allowed are ntree, the number of trees to be grown, and nodesize, the minimum
size of terminal nodes (see randomForest in package randomForest).
For NN, package nnet fits a single-hidden-layer, feed-forward neural
network model. Penalty parameters size (i.e., number of hidden layer units) and
decay (i.e., weight decay) are used (see nnet). For RDA, package
rda fits a shrunken centroids regularized discriminant analysis model.
Tuning parameters include rda.alpha, the shrinkage penalty for the within-class
covariance matrix, and delta, the shrinkage penalty of class centroids towards
the overall dataset centroid. For GBM, package xgboost fits a gradient
boosting machine model. Four tuning parameters are allowed: (1) eta, the
learning rate; (2) max.depth, the maximum tree depth; (3) subsample, the
fraction of samples per tree; and (4) nrounds, the number of boosting trees to
build.
For all six methods, k-fold cross-validation is implemented to tune
classification parameters where the number of folds is set by argument
nfolds. Separately, the trained Parafac, Parafac2, or PCA model is used
to predict the classification mode's component weights using the testing set
data. The predicted component weights and the optimized classification method
are then used to predict class labels. Finally, classification performance
measures are calculated. The process is repeated over a number of replications
with different random splits of the input array and of the class labels at each
replication.
Value
Returns an object of class wrapcpfa that either (1) contains a
three-way array with classification performance measures for each model and
for each replication, or (2) contains a list with matrices with
descriptive statistics for performance measures calculated across all
replications. Specify type.out = "measures" to output the array of
performance measures. Specify type.out = "descriptives" to output
descriptive statistics across replications. In addition, for either option,
the returned list object includes the following:
- predweights
List of predicted classification weights for each component model and for each replication.
- train.weights
List of lists of training weights for each component model and for each replication.
- opt.tune
List of optimal tuning parameters for classification methods for each component model and for each replication.
- opt.model
List of optimal models for training classification methods for each component model and for each replication.
- mean.opt.tune
Mean across all replications of optimal tuning parameters for classification methods for each model.
- X
Two-way matrix, or three-way or four-way data array or list used in argument
x. Ifxwas a three-way or four-way array, and ifmodelwaspca, returns the flattened two-way matrix.- y
Vector of class labels used in input argument
y.- z
Matrix of additional features used in input argument
z.- nfac
Number of components used to fit each component model.
- model
Character designating the component model that was used, including:
model = "parafac"for the Parafac model,model = "parafac2"for the Parafac2 model, ormodel = "pca"for the PCA model.- method
Classification methods used.
- const
Constraints used in fitting the component model. When
modelwaspca, contains the value used in inputpcarotto indicate whether a rotation was applied to the solution.- cmode
Integer value used to specify the mode whose component weights were predictors for classification.
- family
Character value used to specify binary classification (
family = "binomial") or multiclass classification (family = "multinomial").- lxdim
Integer specifying the number of modes of the output
x. Identical to the number of modes of inputx, except when inputxwas a 3-way or 4-way array and whenmodelwaspca. In this case,lxdimis 2.- trainIDs
List where each element contains the vector of integer IDs that specify the rows/observations of the classification mode assigned to the train set for a given replication.
- testIDs
List where each element contains the vector of integer IDs that specify the rows/observations of the classification mode assigned to the test set for a given replication.
- flattened
Logical indicating whether input
xwas flattened into a two-way matrix. Only TRUE whenmodelwaspcaand inputxwas a 3-way or 4-way array.- targetmod
If argument
align = TRUE, a vector containing the identified target Parafac or Parafac2 model (split), for eachnfacvalue greater than one, to which all other models were aligned. Ifnfacis 1, NA is returned. Ifalign = FALSE, returns NULL instead of a list.- changeorders
List of length equal to the length of input
nfac. Each element contains a matrix whose first column contains integers indicating the split/model that was aligned (excludes the integer indicating the target split/model). Thus, fornrepsplits, each matrix hasnrep - 1rows. Subsequent columns contain the order identified by the Hungarian algorithm to match a given model to the target. Thus, for any givennfacvalue, each matrix hasnfac + 1columns. Ifnfacis 1, returns NULL instead of a matrix. Ifalign = FALSE, returns NULL instead of a list.- tccb
List of length equal to the length of input
nfac. Each element contains a matrix whose first column contains integers indicating the split/model that was aligned (excludes the integer indicating the target split/model). Thus, fornrepsplits, each matrix hasnrep - 1rows. Subsequent columns contain Tucker congruence coefficients between a given split/model and the target model for each component for the B mode (second mode) weights. Thus, for any givennfacvalue, each matrix hasnfac + 1columns. Ifnfacis 1, returns NULL instead of a matrix. Ifalign = FALSE, returns NULL instead of a list.- aligned
Logical indicating whether input
alignwas set to TRUE or FALSE.
Note
For Parafac and Parafac2, if argument cmode is not null, input array
x is reshaped with function aperm such that the cmode
dimension of x is ordered last. Estimated mode A and B (and mode C for a
four-way array) weights that are outputted as Aweights and
Bweights (and Cweights) reflect this permutation. For example, if
x is a four-way array and cmode = 2, the original input modes 1,
2, 3, and 4 will correspond to output modes 1, 3, 4, 2. Here, output A = input
1; B = 3, and C = 4 (i.e., the second mode specified by cmode has been
moved to the D mode/last mode). For model = "parafac2", classification
mode is assumed to be the last mode (i.e., mode C for three-way array and mode D
for four-way array). For PCA, if argument cmode is not NULL, and if
x is a 3-way or 4-way array, the array is reshaped with aperm such
that the cmode dimension of x is ordered first. Then, the array is
unfolded into a two-way matrix. If x is already input as a two-way
matrix, the matrix is reshaped if cmode is 2, such that the matrix is
transposed. Note that for PCA, Aweights contains the PCA model loadings.
In addition, note that the following combination of arguments will give an
error: nfac = 1, family = "multinomial", method = "PLR". The issue
arises from providing glmnet::cv.glmnet input x an input matrix
that has a single column. The issue is resolved for family = "binomial"
because a column of 0s is appended to the single column, but this solution
does not appear to work for the multiclass case. As such, this combination of
arguments is not currently allowed. Future updates are planned to resolve this
issue.
Applications of this function to real datasets can be explored at the following repository: https://github.com/matthewasisgress/multiway-classification/.
References
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., Cho, H., Chen, K., Mitchell, R., Cano, I., Zhou, T., Li, M., Xie, J., Lin, M., Geng, Y., Li, Y., Yuan, J. (2025). xgboost: Extreme gradient boosting. R Package Version 1.7.9.1.
Corporation, M. and Weston S (2022). doParallel: foreach parallel adaptor for the 'parallel' package. R Package Version 1.0.17.
Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297.
Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5), 1189-1232.
Friedman, J. H. (1989). Regularized discriminant analysis. Journal of the American Statistical Association, 84(405), 165-175.
Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1-22.
Gaujoux, R. (2025). doRNG: Generic reproducible parallel backend for 'foreach' loops. R Package Version 1.8.6.2.
Guo, Y., Hastie, T., and Tibshirani, R. (2007). Regularized linear discriminant analysis and its application in microarrays. Biostatistics, 8(1), 86-100.
Guo, Y., Hastie, T., and Tibshirani, R. (2023). rda: Shrunken centroids regularized discriminant analysis. R Package Version 1.2-1.
Harshman, R. (1970). Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multimodal factor analysis. UCLA Working Papers in Phonetics, 16, 1-84.
Harshman, R. (1972). PARAFAC2: Mathematical and technical notes. UCLA Working Papers in Phonetics, 22, 30-44.
Harshman, R. and Lundy, M. (1994). PARAFAC: Parallel factor analysis. Computational Statistics and Data Analysis, 18, 39-72.
Helwig, N. (2017). Estimating latent trends in multivariate longitudinal data via Parafac2 with functional and structural constraints. Biometrical Journal, 59(4), 783-803.
Helwig, N. (2025). multiway: Component models for multi-way data. R Package Version 1.0-7.
Hornik, K. (2005). A CLUE for CLUster ensembles. Journal of Statistical Software, 14, 1-25.
Hornik, K. (2026). clue: Cluster ensembles. R Package Version 0.3-68.
Liaw, A. and Wiener, M. (2002). Classification and regression by randomForest. R News 2(3), 18–22.
Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., and Leisch, F. (2024). e1071: Misc functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R Package Version 1.7-16.
Ripley, B. (1994). Neural networks and related methods for classification. Journal of the Royal Statistical Society: Series B (Methodological), 56(3), 409-437.
Venables, W. and Ripley, B. (2002). Modern applied statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0.
Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301-320.
Examples
########## Parafac2 example with 4-way array and multiclass response ##########
if (FALSE) { # \dontrun{
# set seed
set.seed(5)
# define list of arguments specifying distributions for A and G weights
techlist <- list(distA = list(dname = "poisson",
lambda = 3), # for A weights
distG = list(dname = "gamma", shape = 2,
scale = 4)) # for G weights
# define target correlation matrix for columns of D mode weights matrix
cormat <- matrix(c(1, .6, .6, .6, 1, .6, .6, .6, 1), nrow = 3, ncol = 3)
# simulate a four-way ragged array connected to a response
data <- simcpfa(arraydim = c(10, 11, 12, 100), model = "parafac2", nfac = 3,
nclass = 3, nreps = 1e2, onreps = 10, corresp = rep(.6, 3),
meanpred = rep(2, 3), modes = 4, corrpred = cormat,
technical = techlist, smethod = "eigende")
# initialize
alpha <- seq(0, 1, length = 20)
gamma <- c(0, 1)
cost <- c(0.1, 5)
ntree <- c(200, 300)
nodesize <- c(1, 2)
size <- c(1, 2)
decay <- c(0, 1)
rda.alpha <- seq(0.1, 0.9, length = 2)
delta <- c(0.1, 2)
eta <- c(0.3, 0.7)
max.depth <- c(1, 2)
subsample <- c(0.75)
nrounds <- c(100)
method <- c("PLR", "SVM", "RF", "NN", "RDA", "GBM")
family <- "multinomial"
parameters <- list(alpha = alpha, gamma = gamma, cost = cost, ntree = ntree,
nodesize = nodesize, size = size, decay = decay,
rda.alpha = rda.alpha, delta = delta, eta = eta,
max.depth = max.depth, subsample = subsample,
nrounds = nrounds)
model <- "parafac2"
nfolds <- 10
nstart <- 10
# constrain first mode weights to be orthogonal, fourth mode to be nonnegative
const <- c("orthog", "uncons", "uncons", "nonneg")
# fit Parafac2 model and use fourth mode weights to tune classification
# methods, to predict class labels, and to return classification
# performance measures pooled across multiple train-test splits
output <- cpfa(x = data$X, y = data$y, model = model, nfac = 3,
nrep = 5, ratio = 0.9, nfolds = nfolds, method = method,
family = family, parameters = parameters,
type.out = "descriptives", seeds = NULL, plot.out = TRUE,
parallel = FALSE, const = const, nstart = nstart)
# print performance measure means across train-test splits
output$descriptive$mean
} # }