From 15b426c9b1ffea02dc99cfd97373b74bbcd5251d Mon Sep 17 00:00:00 2001 From: hizv <18361766+hizv@users.noreply.github.com> Date: Wed, 14 Feb 2024 15:56:24 +0530 Subject: [PATCH 1/2] Use ++n instead of +p in charmrun examples --- README.md | 4 ++-- doc/ampi/02-building.rst | 4 ++-- doc/ampi/04-extensions.rst | 4 ++-- doc/ampi/05-examples.rst | 34 ++++++++++++++++----------------- doc/charisma/manual.rst | 4 ++-- doc/charm++/manual.rst | 28 +++++++++++++-------------- doc/faq/manual.rst | 2 +- doc/libraries/manual.rst | 2 +- doc/pose/manual.rst | 4 ++-- examples/ParFUM/simple2D/README | 2 +- 10 files changed, 44 insertions(+), 44 deletions(-) diff --git a/README.md b/README.md index 6ce3080e32..01b96ef151 100644 --- a/README.md +++ b/README.md @@ -272,7 +272,7 @@ executable named `nqueen`. Following the previous example, to run the program on two processors, type - $ ./charmrun +p2 ./nqueen 12 6 + $ ./charmrun ++n 2 ./nqueen 12 6 This should run for a few seconds, and print out: `There are 14200 Solutions to 12 queens. Time=0.109440 End time=0.112752` @@ -307,7 +307,7 @@ want to run program on only one machine, for example, your laptop. This can save you all the hassle of setting up ssh daemons. To use this option, just type: - $ ./charmrun ++local ./nqueen 12 100 +p2 + $ ./charmrun ++local ./nqueen 12 100 ++n 2 However, for best performance, you should launch one node program per processor. diff --git a/doc/ampi/02-building.rst b/doc/ampi/02-building.rst index 1f9374f194..d3d6b964de 100644 --- a/doc/ampi/02-building.rst +++ b/doc/ampi/02-building.rst @@ -175,7 +175,7 @@ arguments. A typical invocation of an AMPI program ``pgm`` with .. code-block:: bash - $ ./charmrun +p16 ./pgm +vp64 + $ ./charmrun ++n 16 ./pgm +vp64 Here, the AMPI program ``pgm`` is run on 16 physical processors with 64 total virtual ranks (which will be mapped 4 per processor initially). @@ -189,7 +189,7 @@ example: .. code-block:: bash - $ ./charmrun +p16 ./pgm +vp128 +tcharm_stacksize 32K +balancer RefineLB + $ ./charmrun ++n 16 ./pgm +vp128 +tcharm_stacksize 32K +balancer RefineLB Running with ampirun ~~~~~~~~~~~~~~~~~~~~ diff --git a/doc/ampi/04-extensions.rst b/doc/ampi/04-extensions.rst index a69a38e4d7..aaae798648 100644 --- a/doc/ampi/04-extensions.rst +++ b/doc/ampi/04-extensions.rst @@ -566,7 +566,7 @@ of the AMPI program with some additional command line options. .. code-block:: bash - $ ./charmrun ./pgm +p4 +vp4 +msgLogWrite +msgLogRank 2 +msgLogFilename "msg2.log" + $ ./charmrun ./pgm ++n 4 +vp4 +msgLogWrite +msgLogRank 2 +msgLogFilename "msg2.log" In the above example, a parallel run with 4 worker threads and 4 AMPI ranks will be executed, and the changes in the MPI environment of worker @@ -574,7 +574,7 @@ thread 2 (also rank 2, starting from 0) will get logged into diskfile "msg2.log". Unlike the first run, the re-run is a sequential program, so it is not -invoked by charmrun (and omitting charmrun options like +p4 and +vp4), +invoked by charmrun (and omitting charmrun options like ++n 4 and +vp4), and additional command line options are required as well. .. code-block:: bash diff --git a/doc/ampi/05-examples.rst b/doc/ampi/05-examples.rst index 811da30a77..969bde55b7 100644 --- a/doc/ampi/05-examples.rst +++ b/doc/ampi/05-examples.rst @@ -31,7 +31,7 @@ MiniFE program. - Refer to the ``README`` file on how to run the program. For example: - ``./charmrun +p4 ./miniFE.x nx=30 ny=30 nz=30 +vp32`` + ``./charmrun ++n 4 ./miniFE.x nx=30 ny=30 nz=30 +vp32`` MiniMD v2.0 ~~~~~~~~~~~ @@ -44,7 +44,7 @@ MiniMD v2.0 execute ``make ampi`` to build the program. - Refer to the ``README`` file on how to run the program. For example: - ``./charmrun +p4 ./miniMD_ampi +vp32`` + ``./charmrun ++n 4 ./miniMD_ampi +vp32`` CoMD v1.1 ~~~~~~~~~ @@ -72,7 +72,7 @@ MiniXYCE v1.0 ``test/``. - Example run command: - ``./charmrun +p3 ./miniXyce.x +vp3 -circuit ../tests/cir1.net -t_start 1e-6 -pf params.txt`` + ``./charmrun ++n 3 ./miniXyce.x +vp3 -circuit ../tests/cir1.net -t_start 1e-6 -pf params.txt`` HPCCG v1.0 ~~~~~~~~~~ @@ -84,7 +84,7 @@ HPCCG v1.0 AMPI compilers. - Run with a command such as: - ``./charmrun +p2 ./test_HPCCG 20 30 10 +vp16`` + ``./charmrun ++n 2 ./test_HPCCG 20 30 10 +vp16`` MiniAMR v1.0 ~~~~~~~~~~~~ @@ -140,7 +140,7 @@ Lassen v1.0 - No changes necessary to enable AMPI virtualization. Requires some C++11 support. Set ``AMPIDIR`` in Makefile and ``make``. Run with: - ``./charmrun +p4 ./lassen_mpi +vp8 default 2 2 2 50 50 50`` + ``./charmrun ++n 4 ./lassen_mpi +vp8 default 2 2 2 50 50 50`` Kripke v1.1 ~~~~~~~~~~~ @@ -167,7 +167,7 @@ Kripke v1.1 .. code-block:: bash - $ ./charmrun +p8 ./src/tools/kripke +vp8 --zones 64,64,64 --procs 2,2,2 --nest ZDG + $ ./charmrun ++n 8 ./src/tools/kripke +vp8 --zones 64,64,64 --procs 2,2,2 --nest ZDG MCB v1.0.3 (2013) ~~~~~~~~~~~~~~~~~ @@ -181,7 +181,7 @@ MCB v1.0.3 (2013) .. code-block:: bash - $ OMP_NUM_THREADS=1 ./charmrun +p4 ./../src/MCBenchmark.exe --weakScaling + $ OMP_NUM_THREADS=1 ./charmrun ++n 4 ./../src/MCBenchmark.exe --weakScaling --distributedSource --nCores=1 --numParticles=20000 --multiSigma --nThreadCore=1 +vp16 .. _not-yet-ampi-zed-reason-1: @@ -228,7 +228,7 @@ SNAP v1.01 (C version) while the C version works out of the box on all platforms. - Edit the Makefile for AMPI compiler paths and run with: - ``./charmrun +p4 ./snap +vp4 --fi center_src/fin01 --fo center_src/fout01`` + ``./charmrun ++n 4 ./snap +vp4 --fi center_src/fin01 --fo center_src/fout01`` Sweep3D ~~~~~~~ @@ -248,7 +248,7 @@ Sweep3D - Modify file ``input`` to set the different parameters. Refer to file ``README`` on how to change those parameters. Run with: - ``./charmrun ./sweep3d.mpi +p8 +vp16`` + ``./charmrun ./sweep3d.mpi ++n 8 +vp16`` PENNANT v0.8 ~~~~~~~~~~~~ @@ -264,7 +264,7 @@ PENNANT v0.8 - For PENNANT-v0.8, point CC in Makefile to AMPICC and just ’make’. Run with the provided input files, such as: - ``./charmrun +p2 ./build/pennant +vp8 test/noh/noh.pnt`` + ``./charmrun ++n 2 ./build/pennant +vp8 test/noh/noh.pnt`` Benchmarks ---------- @@ -307,7 +307,7 @@ NAS Parallel Benchmarks (NPB 3.3) *cg.256.C* will appear in the *CG* and ``bin/`` directories. To run the particular benchmark, you must follow the standard procedure of running AMPI programs: - ``./charmrun ./cg.C.256 +p64 +vp256 ++nodelist nodelist`` + ``./charmrun ./cg.C.256 ++n 64 +vp256 ++nodelist nodelist`` NAS PB Multi-Zone Version (NPB-MZ 3.3) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -340,7 +340,7 @@ NAS PB Multi-Zone Version (NPB-MZ 3.3) directory. In the previous example, a file *bt-mz.256.C* will be created in the ``bin`` directory. To run the particular benchmark, you must follow the standard procedure of running AMPI programs: - ``./charmrun ./bt-mz.C.256 +p64 +vp256 ++nodelist nodelist`` + ``./charmrun ./bt-mz.C.256 ++n 64 +vp256 ++nodelist nodelist`` HPCG v3.0 ~~~~~~~~~ @@ -352,7 +352,7 @@ HPCG v3.0 - No AMPI-ization needed. To build, modify ``setup/Make.AMPI`` for compiler paths, do ``mkdir build && cd build && configure ../setup/Make.AMPI && make``. - To run, do ``./charmrun +p16 ./bin/xhpcg +vp64`` + To run, do ``./charmrun ++n 16 ./bin/xhpcg +vp64`` Intel Parallel Research Kernels (PRK) v2.16 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -408,7 +408,7 @@ HYPRE-2.11.1 ``LIBFLAGS``. Then run ``make``. - To run the ``new_ij`` test, run: - ``./charmrun +p64 ./new_ij -n 128 128 128 -P 4 4 4 -intertype 6 -tol 1e-8 -CF 0 -solver 61 -agg_nl 1 27pt -Pmx 6 -ns 4 -mu 1 -hmis -rlx 13 +vp64`` + ``./charmrun ++n 64 ./new_ij -n 128 128 128 -P 4 4 4 -intertype 6 -tol 1e-8 -CF 0 -solver 61 -agg_nl 1 27pt -Pmx 6 -ns 4 -mu 1 -hmis -rlx 13 +vp64`` MFEM-3.2 ~~~~~~~~ @@ -440,7 +440,7 @@ MFEM-3.2 - ``make parallel MFEM_USE_MPI=YES MPICXX=~/charm/bin/ampicxx HYPRE_DIR=~/hypre-2.11.1/src/hypre METIS_DIR=~/metis-4.0.3`` - To run an example, do - ``./charmrun +p4 ./ex15p -m ../data/amr-quad.mesh +vp16``. You may + ``./charmrun ++n 4 ./ex15p -m ../data/amr-quad.mesh +vp16``. You may want to add the runtime options ``-no-vis`` and ``-no-visit`` to speed things up. @@ -464,10 +464,10 @@ XBraid-1.1 HYPRE in their Makefiles and ``make``. - To run an example, do - ``./charmrun +p2 ./ex-02 -pgrid 1 1 8 -ml 15 -nt 128 -nx 33 33 -mi 100 +vp8 ++local``. + ``./charmrun ++n 2 ./ex-02 -pgrid 1 1 8 -ml 15 -nt 128 -nx 33 33 -mi 100 +vp8 ++local``. - To run a driver, do - ``./charmrun +p4 ./drive-03 -pgrid 2 2 2 2 -nl 32 32 32 -nt 16 -ml 15 +vp16 ++local`` + ``./charmrun ++n 4 ./drive-03 -pgrid 2 2 2 2 -nl 32 32 32 -nt 16 -ml 15 +vp16 ++local`` Other AMPI codes ---------------- diff --git a/doc/charisma/manual.rst b/doc/charisma/manual.rst index 53bb4d3a20..0bf8ea4024 100644 --- a/doc/charisma/manual.rst +++ b/doc/charisma/manual.rst @@ -483,7 +483,7 @@ Turing Cluster, use the customized job launcher ``rjq`` or ``rj``). .. code-block:: bash - $ charmrun pgm +p4 + $ charmrun pgm ++n 4 Please refer to Charm++'s manual and tutorial for more details of building and running a Charm++ program. @@ -619,7 +619,7 @@ instance, the following command uses ``RefineLB``. .. code-block:: bash - $ ./charmrun ./pgm +p16 +balancer RefineLB + $ ./charmrun ./pgm ++n 16 +balancer RefineLB .. _secsparse: diff --git a/doc/charm++/manual.rst b/doc/charm++/manual.rst index e78eaf4194..335583c312 100644 --- a/doc/charm++/manual.rst +++ b/doc/charm++/manual.rst @@ -8452,7 +8452,7 @@ mode. For example: .. code-block:: bash - $ ./charmrun hello +p4 +restart log + $ ./charmrun hello ++n 4 +restart log Restarting is the reverse process of checkpointing. Charm++ allows restarting the old checkpoint on a different number of physical @@ -8481,7 +8481,7 @@ After a failure, the system may contain fewer or more processors. Once the failed components have been repaired, some processors may become available again. Therefore, the user may need the flexibility to restart on a different number of processors than in the checkpointing phase. -This is allowable by giving a different ``+pN`` option at runtime. One +This is allowable by giving a different ``++n N`` option at runtime. One thing to note is that the new load distribution might differ from the previous one at checkpoint time, so running a load balancer (see Section :numref:`loadbalancing`) after restart is suggested. @@ -8618,9 +8618,9 @@ it stores them in the local disk. The checkpoint files are named Users can pass the runtime option ``+ftc_disk`` to activate this mode. For example: -.. code-block:: c++ +.. code-block:: bash - ./charmrun hello +p8 +ftc_disk + ./charmrun hello ++n 8 +ftc_disk Building Instructions ^^^^^^^^^^^^^^^^^^^^^ @@ -8629,7 +8629,7 @@ In order to have the double local-storage checkpoint/restart functionality available, the parameter ``syncft`` must be provided at build time: -.. code-block:: c++ +.. code-block:: bash ./build charm++ netlrts-linux-x86_64 syncft @@ -8656,7 +8656,7 @@ name: .. code-block:: bash - $ ./charmrun hello +p8 +kill_file + $ ./charmrun hello ++n 8 +kill_file An example of this usage can be found in the ``syncfttest`` targets in ``tests/charm++/jacobi3d``. @@ -9967,7 +9967,7 @@ program .. code-block:: bash - $ ./charmrun pgm +p1000 +balancer RandCentLB +LBDump 2 +LBDumpSteps 4 +LBDumpFile lbsim.dat + $ ./charmrun pgm ++n 1000 +balancer RandCentLB +LBDump 2 +LBDumpSteps 4 +LBDumpFile lbsim.dat This will collect data on files lbsim.dat.2,3,4,5. We can use this data to analyze the performance of various centralized strategies using: @@ -11330,7 +11330,7 @@ used, and a port number to listen the shrink/expand commands: .. code-block:: bash - $ ./charmrun +p4 ./jacobi2d 200 20 +balancer GreedyLB ++nodelist ./mynodelistfile ++server ++server-port 1234 + $ ./charmrun ++n 4 ./jacobi2d 200 20 +balancer GreedyLB ++nodelist ./mynodelistfile ++server ++server-port 1234 The CCS client to send shrink/expand commands needs to specify the hostname, port number, the old(current) number of processor and the @@ -11988,7 +11988,7 @@ To run a Charm++ program named “pgm” on four processors, type: .. code-block:: bash - $ charmrun pgm +p4 + $ charmrun pgm ++n 4 Execution on platforms which use platform specific launchers, (i.e., **aprun**, **ibrun**), can proceed without charmrun, or charmrun can be @@ -12122,7 +12122,7 @@ advanced options are available: ``++p N`` Total number of processing elements to create. In SMP mode, this refers to worker threads (where - :math:`\texttt{n} * \texttt{ppn} = \texttt{p}`), otherwise it refers + :math:`\texttt{n} \times \texttt{ppn} = \texttt{p}`), otherwise it refers to processes (:math:`\texttt{n} = \texttt{p}`). The default is 1. Use of ``++p`` is discouraged in favor of ``++processPer*`` (and ``++oneWthPer*`` in SMP mode) where desirable, or ``++n`` (and @@ -12230,7 +12230,7 @@ The remaining options cover details of process launch and connectivity: .. code-block:: bash - $ ./charmrun +p4 ./pgm 100 2 3 ++runscript ./set_env_script + $ ./charmrun ++n 4 ./pgm 100 2 3 ++runscript ./set_env_script In this case, ``set_env_script`` is invoked on each node. **Note:** When this is provided, ``charmrun`` will not invoke the program directly, instead only @@ -12526,7 +12526,7 @@ nodes than there are hosts in the group, it will reuse hosts. Thus, .. code-block:: bash - $ charmrun pgm ++nodegroup kale-sun +p6 + $ charmrun pgm ++nodegroup kale-sun ++n 6 uses hosts (charm, dp, grace, dagger, charm, dp) respectively as nodes (0, 1, 2, 3, 4, 5). @@ -12536,7 +12536,7 @@ Thus, if one specifies .. code-block:: bash - $ charmrun pgm +p4 + $ charmrun pgm ++n 4 it will use “localhost” four times. “localhost” is a Unix trick; it always find a name for whatever machine you’re on. @@ -13237,7 +13237,7 @@ of the above incantation, for various kinds of process launchers: .. code-block:: bash - $ ./charmrun +p2 `which valgrind` --log-file=VG.out.%p --trace-children=yes ./application_name ...application arguments... + $ ./charmrun ++n 2 `which valgrind` --log-file=VG.out.%p --trace-children=yes ./application_name ...application arguments... $ aprun -n 2 `which valgrind` --log-file=VG.out.%p --trace-children=yes ./application_name ...application arguments... The first adaptation is to use :literal:`\`which valgrind\`` to obtain a diff --git a/doc/faq/manual.rst b/doc/faq/manual.rst index 8f3bd0287c..9a0f2c1013 100644 --- a/doc/faq/manual.rst +++ b/doc/faq/manual.rst @@ -204,7 +204,7 @@ following command: .. code-block:: bash - ./charmrun +p14 ./pgm ++ppn 7 +commap 0 +pemap 1-7 + ./charmrun ++n 2 ./pgm ++ppn 7 +commap 0 +pemap 1-7 See :ref:`sec-smpopts` of the Charm++ manual for more information. diff --git a/doc/libraries/manual.rst b/doc/libraries/manual.rst index 4862ad0104..5ae7dd3464 100644 --- a/doc/libraries/manual.rst +++ b/doc/libraries/manual.rst @@ -36,7 +36,7 @@ client is a small Java program. A typical use of this is: cd charm/examples/charm++/wave2d make - ./charmrun ./wave2d +p2 ++server ++server-port 1234 + ./charmrun ./wave2d ++n 2 ++server ++server-port 1234 ~/ccs_tools/bin/liveViz localhost 1234 Use git to obtain a copy of ccs_tools (prior to using liveViz) and build diff --git a/doc/pose/manual.rst b/doc/pose/manual.rst index a9c3aef887..0d14a7bb20 100644 --- a/doc/pose/manual.rst +++ b/doc/pose/manual.rst @@ -128,12 +128,12 @@ Running ------- To run the program in parallel, a ``charmrun`` executable was created by -``charmc``. The flag ``+p`` is used to specify a number of processors to +``charmc``. The flag ``++n`` is used to specify a number of processors to run the program on. For example: .. code-block:: bash - $ ./charmrun pgm +p4 + $ ./charmrun pgm ++n 4 This runs the executable ``pgm`` on 4 processors. For more information on how to use ``charmrun`` and set up your environment for parallel diff --git a/examples/ParFUM/simple2D/README b/examples/ParFUM/simple2D/README index 4070a1318f..6069859541 100644 --- a/examples/ParFUM/simple2D/README +++ b/examples/ParFUM/simple2D/README @@ -34,7 +34,7 @@ OUTPUT This program exports its solution data via NetFEM. You can run the program so NetFEM will connect to it like: - ./charmrun ./pgm ++server ++server-port 1234 +p4 + ./charmrun ./pgm ++server ++server-port 1234 ++n 4 You'd then connect the NetFEM client to yourhostname:1234. From 44476f9361cc8470dc8a80bcc80a0b9a52169c0a Mon Sep 17 00:00:00 2001 From: hizv <18361766+hizv@users.noreply.github.com> Date: Wed, 14 Feb 2024 15:58:45 +0530 Subject: [PATCH 2/2] Discuss ++n in SMP options section --- doc/charm++/manual.rst | 33 +++++++++++++++++++++------------ 1 file changed, 21 insertions(+), 12 deletions(-) diff --git a/doc/charm++/manual.rst b/doc/charm++/manual.rst index 335583c312..75f348efe5 100644 --- a/doc/charm++/manual.rst +++ b/doc/charm++/manual.rst @@ -12400,20 +12400,29 @@ like: $ ./charmrun ++ppn 3 +p6 +pemap 1-3,5-7 +commap 0,4 ./app -This will create two logical nodes/OS processes (2 = 6 PEs/3 PEs per -node), each with three worker threads/PEs (``++ppn 3``). The worker -threads/PEs will be mapped thusly: PE 0 to core 1, PE 1 to core 2, PE 2 -to core 3 and PE 4 to core 5, PE 5 to core 6, and PE 6 to core 7. -PEs/worker threads 0-2 compromise the first logical node and 3-5 are the -second logical node. Additionally, the communication threads will be -mapped to core 0, for the communication thread of the first logical -node, and to core 4, for the communication thread of the second logical -node. - Please keep in mind that ``+p`` always specifies the total number of PEs created by Charm++, regardless of mode (the same number as returned by -``CkNumPes()``). The ``+p`` option does not include the communication -thread, there will always be exactly one of those per logical node. +``CkNumPes()``). So this will create two logical nodes/OS processes +(2 = 6 PEs/3 PEs per node), each with three worker threads/PEs +(``++ppn 3``). + +We recommend using ``++n``, especially with ``++ppn``. Recall +that :math:`\texttt{n} \times \texttt{ppn} = \texttt{p}`. So the example becomes: + +.. code-block:: bash + + $ ./charmrun ++ppn 3 ++n 2 +pemap 1-3,5-7 +commap 0,4 ./app + +The worker threads/PEs will be mapped thusly: PE 0 to +core 1, PE 1 to core 2, PE 2 to core 3 and PE 4 to core 5, PE 5 to +core 6, and PE 6 to core 7 (``+pemap``). PEs/worker threads 0-2 +compromise the first logical node and 3-5 are the second logical node. +Additionally, the communication threads will be mapped to core 0, for +the communication thread of the first logical node, and to core 4, +for the communication thread of the second logical node (``+commap``). + +Note that the ``+p`` option does not include the communication +thread. There will always be exactly one of those per logical node. Multicore Options ^^^^^^^^^^^^^^^^^