Skip to content

Commit 70b2a0e

Browse files
authored
Weighted matrix products (#3)
Added support for computation of weighted matrix products including all 16 (12 unique) combinations of centering and scaling of XTWX and XTWY. All combinations have the same asymptotic time and space complexity as the original, unweighted alternatives.
1 parent e027330 commit 70b2a0e

File tree

8 files changed

+1259
-508
lines changed

8 files changed

+1259
-508
lines changed

README.md

Lines changed: 25 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,12 @@
1414

1515
[![Package Status](https://github.com/Sm00thix/CVMatrix/actions/workflows/package_workflow.yml/badge.svg)](https://github.com/Sm00thix/CVMatrix/actions/workflows/package_workflow.yml)
1616

17-
The [`cvmatrix`](https://pypi.org/project/cvmatrix/) package implements the fast cross-validation algorithms by Engstrøm [[1]](#references) for computation of training set $\mathbf{X}^{\mathbf{T}}\mathbf{X}$ and $\mathbf{X}^{\mathbf{T}}\mathbf{Y}$ in a cross-validation setting. In addition to correctly handling arbitrary row-wise pre-processing, the algorithms allow for and efficiently and correctly handle any combination of column-wise centering and scaling of `X` and `Y` based on training set statistical moments.
17+
The [`cvmatrix`](https://pypi.org/project/cvmatrix/) package implements the fast cross-validation algorithms by Engstrøm and Jensen [[1]](#references) for computation of training set $\mathbf{X}^{\mathbf{T}}\mathbf{X}$ and $\mathbf{X}^{\mathbf{T}}\mathbf{Y}$ in a cross-validation setting. In addition to correctly handling arbitrary row-wise pre-processing, the algorithms allow for and efficiently and correctly handle any combination of column-wise centering and scaling of `X` and `Y` based on training set statistical moments.
1818

19-
For an implementation of the fast cross-validation algorithms combined with Improved Kernel Partial Least Squares [[2]](#references), see the Python package [`ikpls`](https://pypi.org/project/ikpls/).
19+
For an implementation of the fast cross-validation algorithms combined with Improved Kernel Partial Least Squares [[2]](#references), see the Python package [`ikpls`](https://pypi.org/project/ikpls/) by Engstrøm et al. [[3]](#references).
20+
21+
## NEW IN 2.0.0: Weighted CVMatrix
22+
The `cvmatrix` software package now also features **weigthed matrix produts** $\mathbf{X}^{\mathbf{T}}\mathbf{W}\mathbf{Y}$ **without increasing time or space complexity compared to the unweighted case**. This is due to a generalization of the algorithms by Engstrøm and Jensen [[1]](#references). A new article formally describing the generalization is to be announced.
2023

2124
## Installation
2225

@@ -46,23 +49,27 @@ For an implementation of the fast cross-validation algorithms combined with Impr
4649
> Y = np.random.uniform(size=(N, M)) # Random Y data
4750
> folds = np.arange(100) % 5 # 5-fold cross-validation
4851
>
52+
> # Weights must be non-negative and the sum of weights for any training partition must
53+
> # be greater than zero.
54+
> weights = np.random.uniform(size=(N,)) + 0.1
55+
>
4956
> # Instantiate CVMatrix
5057
> cvm = CVMatrix(
5158
> folds=folds,
52-
> center_X=True,
53-
> center_Y=True,
54-
> scale_X=True,
55-
> scale_Y=True,
59+
> center_X=True, # Cemter around the weighted mean of X.
60+
> center_Y=True, # Cemter around the weighted mean of Y.
61+
> scale_X=True, # Scale by the weighted standard deviation of X.
62+
> scale_Y=True, # Scale by the weighted standard deviation of Y.
5663
> )
5764
> # Fit on X and Y
58-
> cvm.fit(X=X, Y=Y)
59-
> # Compute training set XTX and/or XTY for each fold
65+
> cvm.fit(X=X, Y=Y, weights=weights)
66+
> # Compute training set XTWX and/or XTWY for each fold
6067
> for fold in cvm.folds_dict.keys():
61-
> # Get both XTX and XTY
68+
> # Get both XTWX and XTWY
6269
> training_XTX, training_XTY = cvm.training_XTX_XTY(fold)
63-
> # Get only XTX
70+
> # Get only XTWX
6471
> training_XTX = cvm.training_XTX(fold)
65-
> # Get only XTY
72+
> # Get only XTWY
6673
> training_XTY = cvm.training_XTY(fold)
6774
6875
### Examples
@@ -87,5 +94,10 @@ Guidelines](https://github.com/Sm00thix/CVMatrix/blob/main/CONTRIBUTING.md).
8794
8895
## References
8996
90-
1. [Engstrøm, O.-C. G. (2024). Shortcutting Cross-Validation: Efficiently Deriving Column-Wise Centered and Scaled Training Set $\mathbf{X}^\mathbf{T}\mathbf{X}$ and $\mathbf{X}^\mathbf{T}\mathbf{Y}$ Without Full Recomputation of Matrix Products or Statistical Moments](https://arxiv.org/abs/2401.13185)
91-
2. [Dayal, B. S., & MacGregor, J. F. (1997). Improved PLS algorithms. *Journal of Chemometrics*, 11(1), 73-85.](https://doi.org/10.1002/(SICI)1099-128X(199701)11:1%3C73::AID-CEM435%3E3.0.CO;2-%23?)
97+
1. [Engstrøm, O.-C. G. and Jensen, M. H. (2025). Fast partition-based cross-validation with centering and scaling for $\mathbf{X}^\mathbf{T}\mathbf{X}$ and $\mathbf{X}^\mathbf{T}\mathbf{Y}$. *Journal of Chemometrics*, 39(3).](https://doi.org/10.1002/cem.70008)
98+
2. [Dayal, B. S. and MacGregor, J. F. (1997). Improved PLS algorithms. *Journal of Chemometrics*, 11(1), 73-85.](https://doi.org/10.1002/(SICI)1099-128X(199701)11:1%3C73::AID-CEM435%3E3.0.CO;2-%23?)
99+
3. [Engstrøm, O.-C. G. and Dreier, E. S. and Jespersen, B. M. and Pedersen, K. S. IKPLS: Improved Kernel Partial Least Squares and Fast Cross-Validation Algorithms for Python with CPU and GPU Implementations Using NumPy and JAX. *Journal of Open Source Software*, 9(99).](https://doi.org/10.21105/joss.06533)
100+
101+
## Funding
102+
- Up until May 31st 2025, this work has been carried out as part of an industrial Ph. D. project receiving funding from [FOSS Analytical A/S](https://www.fossanalytics.com/) and [The Innovation Fund Denmark](https://innovationsfonden.dk/en). Grant number 1044-00108B.
103+
- From June 1st 2025 and onward, this work is sponsored by [FOSS Analytical A/S](https://www.fossanalytics.com/).

benchmarks/benchmark.py

Lines changed: 48 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
Author: Ole-Christian Galbo Engstrøm
1111
E-mail: ole.e@di.ku.dk
1212
"""
13+
1314
import os
1415
import sys
1516

@@ -31,42 +32,32 @@
3132

3233

3334
def save_result_to_csv(
34-
model,
35-
P,
36-
N,
37-
K,
38-
M,
39-
center_X,
40-
center_Y,
41-
scale_X,
42-
scale_Y,
43-
time,
44-
version
45-
):
35+
model, P, N, K, M, center_X, center_Y, scale_X, scale_Y, time, version
36+
):
4637
try:
4738
with open("benchmark_results.csv", "x") as f:
48-
f.write(
49-
"model,P,N,K,M,"
50-
"center_X,center_Y,scale_X,scale_Y,time,version\n"
51-
)
39+
f.write("model,P,N,K,M," "center_X,center_Y,scale_X,scale_Y,time,version\n")
5240
except FileExistsError:
5341
pass
5442
with open("benchmark_results.csv", "a") as f:
5543
f.write(
5644
f"{model},{P},{N},{K},{M},"
5745
f"{center_X},{center_Y},{scale_X},{scale_Y},"
58-
f"{time},{version}\n")
46+
f"{time},{version}\n"
47+
)
48+
5949

6050
def execute_algorithm(
61-
model_class: Union[NaiveCVMatrix, CVMatrix],
62-
cv_splits: Iterable[Hashable],
63-
center_X: bool,
64-
center_Y: bool,
65-
scale_X: bool,
66-
scale_Y: bool,
67-
X: np.ndarray,
68-
Y: np.ndarray,
69-
):
51+
model_class: Union[NaiveCVMatrix, CVMatrix],
52+
cv_splits: Iterable[Hashable],
53+
center_X: bool,
54+
center_Y: bool,
55+
scale_X: bool,
56+
scale_Y: bool,
57+
X: np.ndarray,
58+
Y: np.ndarray,
59+
weights: Union[None, np.ndarray],
60+
):
7061
"""
7162
Execute the computation of the training set matrices
7263
:math:`\mathbf{X}^{\mathbf{T}}\mathbf{X}`
@@ -80,7 +71,7 @@ def execute_algorithm(
8071
8172
cv_splits : Iterable[Hashable]
8273
The cross-validation splits.
83-
74+
8475
center_X : bool
8576
Whether to center `X`.
8677
@@ -98,11 +89,14 @@ def execute_algorithm(
9889
9990
Y : np.ndarray
10091
The target matrix with shape (N, M).
92+
93+
weights : Union[None, np.ndarray]
94+
The weights for the samples, if any. If None, no weights are used.
10195
"""
10296

10397
# Create the model
10498
model = model_class(
105-
cv_splits=cv_splits,
99+
folds=cv_splits,
106100
center_X=center_X,
107101
center_Y=center_Y,
108102
scale_X=scale_X,
@@ -112,31 +106,34 @@ def execute_algorithm(
112106
)
113107

114108
# Fit the model
115-
model.fit(X, Y)
109+
model.fit(X, Y, weights)
116110

117111
# Compute the training set matrices
118-
for fold in model.val_folds_dict.keys():
112+
for fold in model.folds_dict.keys():
119113
model.training_XTX_XTY(fold)
120114

121-
if __name__ == '__main__':
122-
seed = 42 # Seed for reproducibility
115+
116+
if __name__ == "__main__":
117+
seed = 42 # Seed for reproducibility
123118
rng = np.random.default_rng(seed=seed)
124-
N = 100000 # 100k samples
125-
K = 500 # 500 features
126-
M = 10 # 10 targets
127-
dtype = np.float64 # Data type
128-
X = rng.random((N, K), dtype=dtype) # Random X matrix
129-
Y = rng.random((N, M), dtype=dtype) # Random Y matrix
130-
cv_splits = np.arange(N) # We can use mod P for P-fold cross-validation
119+
N = 100000 # 100k samples
120+
K = 500 # 500 features
121+
M = 10 # 10 targets
122+
dtype = np.float64 # Data type
123+
X = rng.random((N, K), dtype=dtype) # Random X matrix
124+
Y = rng.random((N, M), dtype=dtype) # Random Y matrix
125+
# weights = rng.random((N,), dtype=dtype) # Random weights
126+
weights = None
127+
cv_splits = np.arange(N) # We can use mod P for P-fold cross-validation
131128
center_Xs = [True, False]
132129
center_Ys = [True, False]
133130
scale_Xs = [True, False]
134131
scale_Ys = [True, False]
135132
Ps = [3, 5, 10, 100, 1000, 10000, 100000]
136133

137134
for center_X, center_Y, scale_X, scale_Y, P in product(
138-
center_Xs, center_Ys, scale_Xs, scale_Ys, Ps
139-
):
135+
center_Xs, center_Ys, scale_Xs, scale_Ys, Ps
136+
):
140137
print(
141138
f"P={P}, "
142139
f"center_X={center_X}, center_Y={center_Y}, "
@@ -152,8 +149,9 @@ def execute_algorithm(
152149
scale_Y=scale_Y,
153150
X=X,
154151
Y=Y,
152+
weights=weights,
155153
),
156-
number=1
154+
number=1,
157155
)
158156
print(f"CVMatrix, Time: {time:.2f} seconds")
159157
save_result_to_csv(
@@ -167,12 +165,13 @@ def execute_algorithm(
167165
scale_X,
168166
scale_Y,
169167
time,
170-
__version__
168+
__version__,
171169
)
172170

173-
if (center_X == center_Y == scale_X == scale_Y or
174-
center_X == center_Y == True and
175-
scale_X == scale_Y == False
171+
if (
172+
center_X == center_Y == scale_X == scale_Y
173+
or center_X == center_Y == True
174+
and scale_X == scale_Y == False
176175
):
177176
time = timeit(
178177
stmt=lambda: execute_algorithm(
@@ -184,8 +183,9 @@ def execute_algorithm(
184183
scale_Y=scale_Y,
185184
X=X,
186185
Y=Y,
186+
weights=weights,
187187
),
188-
number=1
188+
number=1,
189189
)
190190
print(f"NaiveCVMatrix, Time: {time:.2f} seconds")
191191
print()
@@ -200,5 +200,5 @@ def execute_algorithm(
200200
scale_X,
201201
scale_Y,
202202
time,
203-
__version__
203+
__version__,
204204
)

cvmatrix/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = '1.0.2.post2'
1+
__version__ = "2.0.0"

0 commit comments

Comments
 (0)