base
optimus_dl.modules.distributed.base
¶
Collective
¶
Bases: ABC
Abstract base class for distributed communication.
This class defines the interface for all collective operations and distributed topology information. It allows the framework to switch between real distributed training (MeshCollective) and single-device/CPU execution (FakeCollective) without changing the training logic.
Attributes:
| Name | Type | Description |
|---|---|---|
rank |
int
|
Global rank of the current process. |
world_size |
int
|
Total number of processes in the global gang. |
Source code in optimus_dl/modules/distributed/base.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 | |
default_device
abstractmethod
property
¶
Get the default PyTorch device for this process.
dp_rank
abstractmethod
property
¶
Rank within the Data Parallelism group.
dp_world_size
abstractmethod
property
¶
Size of the Data Parallelism group.
is_local_master
property
¶
True if the current process is the master of its node (local rank 0).
is_master
property
¶
True if the current process is the master (rank 0).
local
abstractmethod
property
¶
Get a collective limited to the current node (local ranks).
local_rank
abstractmethod
property
¶
Rank within the current node.
process_group
abstractmethod
property
¶
The underlying PyTorch ProcessGroup, if available.
tp_rank
abstractmethod
property
¶
Rank within the Tensor Parallelism group.
tp_world
abstractmethod
property
¶
Get a collective for the current Tensor Parallelism group.
tp_world_size
abstractmethod
property
¶
Size of the Tensor Parallelism group.
all_gather(output_tensor, input_tensor)
abstractmethod
¶
all_gather_objects(object)
abstractmethod
¶
all_gather_to_list(output_tensors, input_tensor)
abstractmethod
¶
Perform an all-gather operation into a list of tensors.