Enable Pipeline Parallelism on jax worker #1043

Chenyaaang · 2025-11-07T18:07:53Z

Description

The implementation for Pipeline Parallelism are splitted into the following small PRs.

This PR is to modify Jax worker to support PP.

The worker __init__ takes in the current worker's IP and its previous worker's IP, to start transfer server and connection later.
During execute_model, for the PP workers who not in the first rank, they need to receive intermediate tensor from the previous worker. For the PP workers who is not the last rank, they need to send intermediate tensor to their next worker.
The profiler should profile every PP worker, so add subfolders under the parent profile_dir to save profiles for each worker.

Tests

E2E test has verified the whole PP implementation works properly.

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

github-actions · 2025-11-07T18:08:06Z

Description

Start with a short description of what the PR does and how this is a change from
the past.

The rest of the description includes relevant details and context, examples:

why is this change being made,
the problem being solved and any relevant context,
why this is a good solution,
some information about the specific implementation,
shortcomings of the solution and possible future improvements.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/123456
FIXES: #123456

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

tpu_inference/worker/tpu_worker.py

mrjunwan-lang · 2025-11-12T19:23:18Z

tpu_inference/worker/tpu_worker_jax.py

+        multihost_backend = os.environ.get("TPU_MULTIHOST_BACKEND", "").lower()
+        if multihost_backend != "ray" and self.parallel_config.pipeline_parallel_size > 1:
+            # Note: Below is the setting for v6e8 host (8 chips of v6e)
+            # There are 2 ways of subslicing a v6e:


do we need report errors if the settings are not in these 2 ways?

I use v6e8 as an example to provide 2 ways to subslice the chips. I was thinking if the customer is using other chips, they should replace with their own topology. Do you have any better idea to interpret this?

Can the topology be passed as config variables (and make the default as one of v6e's supported topology) or at least a parameter of init_device() so people would find the needed changes more easily? And please move lines 136-141 to line 152.

There are 6 vars in total:

TPU_PROCESS_ADDRESSES: all workers are the same

TPU_PROCESS_PORT: different in each worker

CLOUD_TPU_TASK_ID: different in each worker

TPU_PROCESS_BOUNDS: all workers are the same

TPU_CHIPS_PER_PROCESS_BOUNDS: all workers are the same

TPU_VISIBLE_CHIPS: different in each worker

It might be difficult to pass from configs. I made it as parameters of init_device() method.

tpu_inference/worker/tpu_worker_jax.py

Chenyaaang · 2025-11-12T20:30:40Z

@sixiang-google Hi Xiang, can you help me take a look at init_device() will my change to local_devices affect disagg serving? Thanks

tpu_inference/worker/tpu_worker_jax.py

tpu_inference/worker/tpu_worker.py

yixinshi

Good work! A general comment: shall we have more specific PR title here for each PR?

tpu_inference/worker/tpu_worker_jax.py

Signed-off-by: Chenyaaang <chenyangli@google.com>

Chenyaaang requested review from mrjunwan-lang and yixinshi November 7, 2025 18:23

Chenyaaang force-pushed the chenyangli/pp-2 branch 2 times, most recently from e066aa3 to 4cac52f Compare November 8, 2025 01:06

This was referenced Nov 8, 2025

Enable Pipeline Parallelism on Jax runner #1053

Open

Enable Pipeline Parallelism to use mp as distributed backend on Jax TPU platform #1054

Merged

Enable Pipeline Parallelism on torchax path #1055

Merged

vanbasten23 reviewed Nov 11, 2025

View reviewed changes

tpu_inference/worker/tpu_worker.py Show resolved Hide resolved

vanbasten23 reviewed Nov 11, 2025

View reviewed changes

tpu_inference/worker/tpu_worker.py Show resolved Hide resolved

vanbasten23 reviewed Nov 11, 2025

View reviewed changes

tpu_inference/worker/tpu_worker.py Show resolved Hide resolved

This was referenced Nov 12, 2025

Enable Pipeline Parallelism on Jax models #1077

Open

Enable Pipeline Parallelism on Ray #1078

Merged

mrjunwan-lang reviewed Nov 12, 2025

View reviewed changes

tpu_inference/worker/tpu_worker_jax.py Outdated Show resolved Hide resolved

Chenyaaang requested a review from sixiang-google November 12, 2025 20:29

yixinshi reviewed Nov 13, 2025

View reviewed changes

tpu_inference/worker/tpu_worker_jax.py Outdated Show resolved Hide resolved

yixinshi reviewed Nov 13, 2025

View reviewed changes

tpu_inference/worker/tpu_worker.py Show resolved Hide resolved

yixinshi reviewed Nov 13, 2025

View reviewed changes

tpu_inference/worker/tpu_worker.py Show resolved Hide resolved

yixinshi reviewed Nov 13, 2025

View reviewed changes

mrjunwan-lang reviewed Nov 13, 2025

View reviewed changes

tpu_inference/worker/tpu_worker_jax.py Outdated Show resolved Hide resolved

Chenyaaang added 3 commits November 14, 2025 03:39

worker changes for pp

3a40604

Signed-off-by: Chenyaaang <chenyangli@google.com>

resolve comments

41b8cd4

Signed-off-by: Chenyaaang <chenyangli@google.com>

fix comments

2b74070

Signed-off-by: Chenyaaang <chenyangli@google.com>

Chenyaaang force-pushed the chenyangli/pp-2 branch from 103a581 to 2b74070 Compare November 14, 2025 03:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable Pipeline Parallelism on jax worker #1043

Enable Pipeline Parallelism on jax worker #1043

Chenyaaang commented Nov 7, 2025

Uh oh!

github-actions bot commented Nov 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mrjunwan-lang Nov 12, 2025

Uh oh!

Chenyaaang Nov 12, 2025

Uh oh!

yixinshi Nov 13, 2025

Uh oh!

Chenyaaang Nov 14, 2025

Uh oh!

Uh oh!

Chenyaaang commented Nov 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yixinshi left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Enable Pipeline Parallelism on jax worker #1043

Are you sure you want to change the base?

Enable Pipeline Parallelism on jax worker #1043

Conversation

Chenyaaang commented Nov 7, 2025

Description

Tests

Checklist

Uh oh!

github-actions bot commented Nov 7, 2025

Description

Tests

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mrjunwan-lang Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Chenyaaang Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

yixinshi Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Chenyaaang Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Chenyaaang commented Nov 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yixinshi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants