vllm.v1.worker.dp_utils ¶
_get_device_and_group ¶
_get_device_and_group(parallel_config: ParallelConfig)
Source code in vllm/v1/worker/dp_utils.py
_post_process_cudagraph_mode ¶
Synchronize cudagraph_mode across DP ranks by taking the minimum. If any rank has NONE (0), all ranks use NONE. This ensures all ranks send consistent values (all padded or all unpadded).
Source code in vllm/v1/worker/dp_utils.py
_post_process_dp_padding ¶
Source code in vllm/v1/worker/dp_utils.py
_post_process_ubatch ¶
Source code in vllm/v1/worker/dp_utils.py
_run_ar ¶
_run_ar(
should_ubatch: bool,
should_dp_pad: bool,
orig_num_tokens_per_ubatch: int,
padded_num_tokens_per_ubatch: int,
cudagraph_mode: int,
parallel_config: ParallelConfig,
) -> Tensor
Source code in vllm/v1/worker/dp_utils.py
_synchronize_dp_ranks ¶
_synchronize_dp_ranks(
num_tokens_unpadded: int,
num_tokens_padded: int,
should_attempt_ubatching: bool,
should_attempt_dp_padding: bool,
cudagraph_mode: int,
parallel_config: ParallelConfig,
) -> tuple[bool, Tensor | None, int]
-
Decides if each DP rank is going to microbatch. Either all ranks run with microbatching or none of them do.
-
Determines the total number of tokens that each rank will run. When running microbatched or if should_attempt_dp_padding is True, all ranks will be padded out so that the run with the same number of tokens
-
Synchronizes cudagraph_mode across ranks by taking the minimum.
tuple[
| Name | Type | Description |
|---|---|---|
should_ubatch | bool | Are all DP ranks going to microbatch |
num_tokens_after_padding | Tensor | None | A tensor containing the total number of |
int | tokens per-microbatch for each DP rank including any DP padding. | |
synced_cudagraph_mode | tuple[bool, Tensor | None, int] | The synchronized cudagraph mode (min across ranks) |
]
Source code in vllm/v1/worker/dp_utils.py
coordinate_batch_across_dp ¶
coordinate_batch_across_dp(
num_tokens_unpadded: int,
allow_microbatching: bool,
allow_dp_padding: bool,
parallel_config: ParallelConfig,
num_tokens_padded: int | None = None,
uniform_decode: bool | None = None,
num_scheduled_tokens_per_request: ndarray | None = None,
cudagraph_mode: int = 0,
) -> tuple[bool, Tensor | None, int]
Coordinates amongst all DP ranks to determine if and how the full batch should be split into microbatches.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
num_tokens_unpadded | int | Number of tokens without accounting for padding | required |
allow_microbatching | bool | If microbatching should be attempted | required |
allow_dp_padding | bool | If all DP ranks should be padded up to the same value | required |
parallel_config | ParallelConfig | The parallel config | required |
num_tokens_padded | int | None | Number of tokens including any non-DP padding (CUDA graphs, TP, etc) | None |
uniform_decode | bool | None | Only used if allow_microbatching is True. True if the batch only contains single token decodes | None |
num_scheduled_tokens_per_request | ndarray | None | Only used if allow_microbatching is True. The number of tokens per request. | None |
cudagraph_mode | int | The cudagraph mode for this rank (0=NONE, 1=PIECEWISE, 2=FULL) | 0 |
tuple[
| Name | Type | Description |
|---|---|---|
ubatch_slices | bool | if this is set then all DP ranks have agreed to |
Tensor | None | microbatch | |
num_tokens_after_padding | int | A tensor containing the total number of |
tuple[bool, Tensor | None, int] | tokens per-microbatch for each DP rank including padding. Will be | |
tuple[bool, Tensor | None, int] | padded up to the max value across all DP ranks when allow_dp_padding | |
tuple[bool, Tensor | None, int] | is True. | |
synced_cudagraph_mode | tuple[bool, Tensor | None, int] | The synchronized cudagraph mode (min across ranks) |
]