rope

`optimus_dl.modules.model.blocks.rope` ¶

Rotary Positional Embeddings (RoPE) implementation.

This module provides utilities for computing and applying Rotary Positional Embeddings, as used in models like Llama and Qwen.

`apply_rotary_emb(q, k, freqs_cis, position_ids=None)` ¶

Apply Rotary Positional Embeddings to Query and Key tensors.

Handles both standard Tensors and distributed DTensors.

Parameters:

Name	Type	Description	Default
`q`	`Tensor`	Query tensor of shape (B, T, nh, hs).	required
`k`	`Tensor`	Key tensor of shape (B, T, n_kv_h, hs).	required
`freqs_cis`	`Tensor`	Precomputed frequency tensor of shape (max_T, hs // 2, 2).	required
`position_ids`	`Tensor \| None`	Optional tensor of shape (B, T) specifying the positions for each token.	`None`

Returns:

Type	Description
`tuple[Tensor, Tensor]`	Tuple of (q, k) with rotary embeddings applied.

Source code in optimus_dl/modules/model/blocks/rope.py

def apply_rotary_emb(
    q: torch.Tensor,
    k: torch.Tensor,
    freqs_cis: torch.Tensor,
    position_ids: torch.Tensor | None = None,
) -> tuple[torch.Tensor, torch.Tensor]:
    """Apply Rotary Positional Embeddings to Query and Key tensors.

    Handles both standard Tensors and distributed DTensors.

    Args:
        q: Query tensor of shape (B, T, nh, hs).
        k: Key tensor of shape (B, T, n_kv_h, hs).
        freqs_cis: Precomputed frequency tensor of shape (max_T, hs // 2, 2).
        position_ids: Optional tensor of shape (B, T) specifying the positions for each token.

    Returns:
        Tuple of (q, k) with rotary embeddings applied.
    """
    is_q_dtensor = isinstance(q, DTensor)
    is_k_dtensor = isinstance(k, DTensor)
    is_freqs_cis_dtensor = isinstance(freqs_cis, DTensor)

    q_in = q.to_local() if is_q_dtensor else q
    k_in = k.to_local() if is_k_dtensor else k
    freqs_cis_in = freqs_cis.to_local() if is_freqs_cis_dtensor else freqs_cis

    # Input dtype for restoration
    input_dtype = q_in.dtype

    _, T = q_in.shape[0], q_in.shape[1]

    if position_ids is not None:
        # freqs_cis_in: (max_T, hs//2, 2)
        # position_ids: (B, T)
        # Result: (B, T, hs//2, 2)
        freqs_cis_in = freqs_cis_in[position_ids]
    else:
        # Assume positions 0..T-1
        freqs_cis_in = freqs_cis_in[:T]

    q_in = q_in.float().reshape(*q_in.shape[:-1], -1, 2)
    k_in = k_in.float().reshape(*k_in.shape[:-1], -1, 2)

    freqs_cis_res = _reshape_for_broadcast(freqs_cis_in, q_in)

    # Perform manual "complex" multiplication
    q_cos = q_in[..., 0] * freqs_cis_res[..., 0] - q_in[..., 1] * freqs_cis_res[..., 1]
    q_sin = q_in[..., 0] * freqs_cis_res[..., 1] + q_in[..., 1] * freqs_cis_res[..., 0]
    k_cos = k_in[..., 0] * freqs_cis_res[..., 0] - k_in[..., 1] * freqs_cis_res[..., 1]
    k_sin = k_in[..., 0] * freqs_cis_res[..., 1] + k_in[..., 1] * freqs_cis_res[..., 0]

    # Combine the results back into the interleaved format expected by q and k
    q_out = (
        torch.stack((q_cos, q_sin), dim=-1)
        .reshape(q_in.shape)
        .flatten(3)
        .to(input_dtype)
    )
    k_out = (
        torch.stack((k_cos, k_sin), dim=-1)
        .reshape(k_in.shape)
        .flatten(3)
        .to(input_dtype)
    )

    # Wrap back to DTensor if inputs were DTensor
    if is_q_dtensor:
        q_out = DTensor.from_local(q_out, q.device_mesh, q.placements)
    if is_k_dtensor:
        k_out = DTensor.from_local(k_out, k.device_mesh, k.placements)

    return q_out, k_out

`precompute_freqs_cis(dim, end, theta=10000.0, scaling_config=None)` ¶

Precompute the frequency tensor for complex exponential (cis) with optional scaling.