Skip to content

Conversation

@muhammadalhroob
Copy link
Contributor

This PR enhances the matrix multiplication by rearranging loops and blocking to enhance the speed (a factor of 6 for very large matrices)

This Pull request:

Changes or fixes:

Checklist:

  • [ *] tested changes locally
  • updated the docs (if necessary)

This PR fixes #
Speeds up the code

…e blocking to enhance the speed (a factor of 6 for very large matrices)
@dpiparo
Copy link
Member

dpiparo commented Nov 10, 2025

Thanks! Could we have some tests in addition to the algorithm to ensure it works?

@muhammadalhroob
Copy link
Contributor Author

muhammadalhroob commented Nov 10, 2025

I generated two 20×20 matrices using TRandom3 r(1) and verified the correctness of the resulting matrix C = A * B by printing the output. The attached slide includes the verification results, while the attached macro shows how the matrices were created and multiplied. I have also included a performance chart that highlights the observed improvements achieved through this implementation.
CompareAMultB.pdf
MatrixMultiplication
testMatrixMultiplication.C

@github-actions
Copy link

github-actions bot commented Nov 25, 2025

Test Results

    22 files      22 suites   3d 21h 3m 44s ⏱️
 3 785 tests  3 783 ✅ 0 💤 2 ❌
81 249 runs  81 247 ✅ 0 💤 2 ❌

For more details on these failures, see this check.

Results for commit ebf8d63.

♻️ This comment has been updated with latest results.

@guitargeek
Copy link
Contributor

Hi @muhammadalhroob, thank you for your contribution!

No need to write more tests, because the TMatrix class is already well tested in the CI, and some tests are failing by the way 🙂

TEST FAILURES:
416:gtest-roofit-roofitcore-testSumW2Error
454:test-stresslinear
455:test-stresslinear-interpreted
2792:roottest-root-math-smatrix-Kalman
2793:roottest-root-math-smatrix-Operations

Could you please make sure that tests are passing before we can proceed? Thanks!

@muhammadalhroob
Copy link
Contributor Author

Hi @guitargeek,
I was actually about to write to you — you were faster!
From the failing tests, it looks like the issue is coming from symmetric matrix multiplication.
Test 11 : Symmetric Matrix Multiplications.......................FAILED
I have run many local tests on systematic matrix multiplication, and everything seemed fine, so I am still trying to understand why the CI reports failures. It might be related to numerical precision, but I am not sure.

@guitargeek
Copy link
Contributor

I don't think it's related to numerics. If you look at the output of roottest-root-math-smatrix-Operations for example, you see completely different values in the test:

2025-11-25T15:06:55.4764218Z -- BEGIN ERRDIFF OUTPUT --
2025-11-25T15:06:55.4791591Z --- /github/home/ROOT-CI/src/roottest/root/math/smatrix/testOperations.ref	Tue Nov 25 13:12:10 2025
2025-11-25T15:06:55.4862367Z +++ /github/home/ROOT-CI/build/roottest/root/math/smatrix/Operations.err	Tue Nov 25 16:06:55 2025
2025-11-25T15:06:55.4862711Z @@ -1,4 +1,4 @@
2025-11-25T15:06:55.4862878Z  SMatrix:     r1 = 299.766 r2 = -17486.4
2025-11-25T15:06:55.4863072Z -TMatrix:     r1 = 299.766 r2 = -17486.4
2025-11-25T15:06:55.4863260Z +TMatrix:     r1 = 299.766 r2 = 25523.4
2025-11-25T15:06:55.4863443Z  SMatrixSym:  r1 = 6972.11
2025-11-25T15:06:55.4863613Z  TMatrixSym:  r1 = 6972.11
2025-11-25T15:06:55.4863713Z 
2025-11-25T15:06:55.4863787Z -- END ERRDIFF OUTPUT --

I would appreciate if you could look into this more, as we don't want to miss out on this performance improvement 🙂

@muhammadalhroob
Copy link
Contributor Author

muhammadalhroob commented Nov 27, 2025

Hi @guitargeek,
It turns out the smatrix test was failing because of an uninitialized variable. I have fixed that and added further optimisations for large matrices. The original ROOT implementation still uses an if block for small matrices, but I removed all pointer arithmetic. Performance for large matrices is now significantly improved (24x for 2000*2000).

@muhammadalhroob
Copy link
Contributor Author

Hi @guitargeek,

I see that 19 CI checks have passed and 7 have failed. I am not sure whether the failing checks are related to the new matrix multiplication algorithm.

Cheers,
Muhammad Alhroob

@guitargeek
Copy link
Contributor

Some of the failures are related, in particular on debian125 and debian13, where compiler warnings are treated as errors. You can see in the logs that there are now errors about unused variables:

2025-11-28T14:43:10.7387749Z /github/home/ROOT-CI/src/math/matrix/src/TMatrixT.cxx: In instantiation of ‘void TMatrixTAutoloadOps::AMultB(const Element*, Int_t, Int_t, const Element*, Int_t, Int_t, Element*) [with Element = float; Int_t = int]’:
2025-11-28T14:43:10.7522482Z /github/home/ROOT-CI/src/math/matrix/src/TMatrixT.cxx:3308:112:   required from here
2025-11-28T14:43:10.7668391Z ##[error]/github/home/ROOT-CI/src/math/matrix/src/TMatrixT.cxx:3079:114: error: unused parameter ‘nb’ [-Werror=unused-parameter]
2025-11-28T14:43:10.7676986Z  3079 | void TMatrixTAutoloadOps::AMultB(const Element *const ap, Int_t na, Int_t ncolsa, const Element *const bp, Int_t nb,
2025-11-28T14:43:10.7761705Z       |                                                                                                            ~~~~~~^~
2025-11-28T14:43:10.7764475Z /github/home/ROOT-CI/src/math/matrix/src/TMatrixT.cxx: In instantiation of ‘void TMatrixTAutoloadOps::AMultB(const Element*, Int_t, Int_t, const Element*, Int_t, Int_t, Element*) [with Element = double; Int_t = int]’:
2025-11-28T14:43:10.7766043Z /github/home/ROOT-CI/src/math/matrix/src/TMatrixT.cxx:3365:115:   required from here
2025-11-28T14:43:10.7768859Z ##[error]/github/home/ROOT-CI/src/math/matrix/src/TMatrixT.cxx:3079:114: error: unused parameter ‘nb’ [-Werror=unused-parameter]
2025-11-28T14:43:10.7770822Z cc1plus: all warnings being treated as errors
2025-11-28T14:43:10.7771659Z gmake[2]: *** [math/matrix/CMakeFiles/Matrix.dir/build.make:202: math/matrix/CMakeFiles/Matrix.dir/src/TMatrixT.cxx.o] Error 1
2025-11-28T14:43:10.7772626Z gmake[1]: *** [CMakeFiles/Makefile2:53360: math/matrix/CMakeFiles/Matrix.dir/all] Error 2
2025-11-28T14:43:10.7773263Z gmake[1]: *** Waiting for unfinished jobs....

It nb is really unused now, you can comment out its name in the function signature to mark it as unused:

void TMatrixTAutoloadOps::AMultB(const Element *const ap, Int_t na, Int_t ncolsa, const Element *const bp, Int_t /*nb*/,
                                 Int_t ncolsb, Element *cp)

@muhammadalhroob
Copy link
Contributor Author

muhammadalhroob commented Dec 2, 2025

Thanks a lot @guitargeek,
I have commented out nb and pushed again. All CI have passed except two.
Cheers

Copy link
Contributor

@guitargeek guitargeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this nice optimization!

@guitargeek guitargeek merged commit baee12d into root-project:master Dec 3, 2025
28 of 30 checks passed
@ferdymercury
Copy link
Collaborator

ferdymercury commented Dec 3, 2025

Thanks! @muhammadalhroob could you clarify if AI did the parallel openmp implementation or if it was just used for the benchmark tests?

fyi @silverweed

@muhammadalhroob
Copy link
Contributor Author

Hi @ferdymercury,
AI suggested adding the OpenMP implementation, but when I ran the benchmarks I did not compile with the OpenMP option enabled.

@guitargeek
Copy link
Contributor

Yeah the OpenMP might be indeed problematic, because it implies multithreading that can be surprising to the user. But since it's done for a minimum entry size of 10000, I expect this not to trigger often and the risk of problems is low. So I thought it was worth to keep it.

@hahnjo
Copy link
Member

hahnjo commented Dec 4, 2025

Some comments:

  1. The OpenMP implementation is also problematic because ROOT (and other frameworks) use TBB and the two do not interoperate. This means that you will get one OpenMP thread pool per TBB thread, which will heavily oversubscribe the machine and be bad for performance. Not that anyone should be using TMatrix for performance, see also point 3.
  2. The OpenMP implementation is likely not even compiled in since we build without -fopenmp and I'm not sure how many users would add the flag themselves when compiling ROOT.
  3. Finally, optimizing general matrix multiplication (gemm) is really, really hard. Up to the point that processor companies pay teams of engineers to hand-write and optimize assembly code. Anybody caring for performance should use BLAS for standard linear algebra operations.

My 2 cents: the OpenMP implementation should likely be removed since it's not compiled by default and does not work well with other parallelization options in ROOT.

@muhammadalhroob
Copy link
Contributor Author

Hi @guitargeek,
Do you propose removing OpenMP and making a new PR?

@guitargeek
Copy link
Contributor

Hi! Yes, given the discussion here, it would be good if you open a new PR, removing the OpenMP pragmas.

@muhammadalhroob muhammadalhroob deleted the SpeedUpMatrixMultiplication branch December 4, 2025 09:12
@muhammadalhroob
Copy link
Contributor Author

Hi @guitargeek,
Thanks, here is #20639

guitargeek pushed a commit that referenced this pull request Dec 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants