The cvmatrix
package implements the fast cross-validation algorithms by Engstrøm and Jensen [1] for computation of training set X
and Y
based on training set statistical moments.
For an implementation of the fast cross-validation algorithms combined with Improved Kernel Partial Least Squares [2], see the Python package ikpls
by Engstrøm et al. [3].
The cvmatrix
software package now also features weigthed matrix produts
-
Install the package for Python3 using the following command:
pip3 install cvmatrix
-
Now you can import the class implementing all the algorithms with:
from cvmatrix.cvmatrix import CVMatrix
import numpy as np from cvmatrix.cvmatrix import CVMatrix N = 100 # Number of samples. K = 50 # Number of features. M = 10 # Number of targets. X = np.random.uniform(size=(N, K)) # Random X data Y = np.random.uniform(size=(N, M)) # Random Y data folds = np.arange(100) % 5 # 5-fold cross-validation # Weights must be non-negative and the sum of weights for any training partition must # be greater than zero. weights = np.random.uniform(size=(N,)) + 0.1 # Instantiate CVMatrix cvm = CVMatrix( folds=folds, center_X=True, # Cemter around the weighted mean of X. center_Y=True, # Cemter around the weighted mean of Y. scale_X=True, # Scale by the weighted standard deviation of X. scale_Y=True, # Scale by the weighted standard deviation of Y. ) # Fit on X and Y cvm.fit(X=X, Y=Y, weights=weights) # Compute training set XTWX and/or XTWY for each fold for fold in cvm.folds_dict: # Get both XTWX, XTWY, and weighted statistics result = cvm.training_XTX_XTY(fold) (training_XTWX, training_XTWY) = result[0] (training_X_mean, training_X_std, training_Y_mean, training_Y_std) = result[1] # Get only XTWX and weighted statistics for X. # Weighted statistics for Y are returned as None as they are not computed when # only XTWX is requested. result = cvm.training_XTX(fold) training_XTWX = result[0] (training_X_mean, training_X_std, training_Y_mean, training_Y_std) = result[1] # Get only XTWY and weighted statistics result = cvm.training_XTY(fold) training_XTWY = result[0] (training_X_mean, training_X_std, training_Y_mean, training_Y_std) = result[1]
In examples, you will find:
In benchmarks, we have benchmarked cross-validation of the fast algorithms in cvmatrix
against the baseline algorithms implemented in NaiveCVMatrix.
Left: Benchmarking cross-validation with the CVMatrix implementation versus the baseline implementation using three common combinations of (column-wise) centering and scaling. Right: Benchmarking cross-validation with the CVMatrix implementation for all possible combinations of (column-wise) centering and scaling. Here, most of the graphs lie on top of eachother. In general, no preprocessing is faster than centering which, in turn, is faster than scaling.
To contribute, please read the Contribution Guidelines.
- Engstrøm, O.-C. G. and Jensen, M. H. (2025). Fast partition-based cross-validation with centering and scaling for $\mathbf{X}^\mathbf{T}\mathbf{X}$ and $\mathbf{X}^\mathbf{T}\mathbf{Y}$. Journal of Chemometrics, 39(3).
- Dayal, B. S. and MacGregor, J. F. (1997). Improved PLS algorithms. Journal of Chemometrics, 11(1), 73-85.
- Engstrøm, O.-C. G. and Dreier, E. S. and Jespersen, B. M. and Pedersen, K. S. (2024). IKPLS: Improved Kernel Partial Least Squares and Fast Cross-Validation Algorithms for Python with CPU and GPU Implementations Using NumPy and JAX. Journal of Open Source Software, 9(99).
- Up until May 31st 2025, this work has been carried out as part of an industrial Ph. D. project receiving funding from FOSS Analytical A/S and The Innovation Fund Denmark. Grant number 1044-00108B.
- From June 1st 2025 and onward, this work is sponsored by FOSS Analytical A/S.