Upload folder using huggingface_hub

Browse files

Files changed (13) hide show

README.md +161 -0
config.json +24 -0
generation_config.json +4 -0
inference.py +37 -0
merges.txt +0 -0
model.safetensors +3 -0
models/__init__.py +1 -0
models/bitlinear.py +51 -0
models/model_v1_earlyexit.py +450 -0
special_tokens_map.json +6 -0
tokenizer.json +0 -0
tokenizer_config.json +20 -0
vocab.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,161 @@

+---
+license: mit
+language:
+- en
+pipeline_tag: text-generation
+tags:
+- bitnet
+- quantization
+- early-exit
+- layer-skipping
+- efficient-transformers
+datasets:
+- roneneldan/TinyStories
+---
+# bitskip-v1-earlyexit
+BitSkip v1 with 8-bit activation quantization and ternary weights (no Hadamard transform)
+## Model Description
+This model implements a 24-layer transformer with early exit loss and quadratic layer dropout for efficient inference. It was trained on the TinyStories dataset with layer-wise auxiliary supervision to enable flexible speed-quality tradeoffs during inference.
+## Architecture Details
+- **Layers**: 24
+- **Hidden dimension**: 2048
+- **Attention heads**: 32 (64-dimensional each)
+- **Key-Value heads**: 8 (Grouped Query Attention with 4:1 ratio)
+- **FFN intermediate size**: 4096
+- **Position embeddings**: Rotary Position Embeddings (RoPE)
+- **Normalization**: RMSNorm
+- **Activation**: SwiGLU (for MLP)
+- **Parameters**: ~1.06B
+### Quantization Scheme
+- **Weights**: Ternary {-1, 0, 1}
+- **Activations**: 8-bit quantization
+- **Hadamard**: No
+## Training Details
+### Dataset
+- **Source**: TinyStories (2.1M stories)
+- **Tokenizer**: GPT-2 BPE (vocab size: 50,257)
+- **Sequence length**: 512 tokens
+### Training Techniques
+**Quadratic Layer Dropout:**
+- Progressive dropout: p_l = 0.5 × (l/L)²
+- Normalized so Σp_l = 1.0
+- Never drops final layer
+- Makes earlier layers more accurate
+**Early Exit Loss:**
+- All layers share the same LM head
+- Loss = main_loss + 0.3 × early_exit_loss
+- Layer-proportional weighting: w_i = (i+1)/L
+- Enables flexible early exit at inference
+### Hyperparameters
+- **Optimizer**: AdamW
+- **Learning rate**: 6e-4
+- **Warmup steps**: 1000
+- **Batch size**: 16 (effective: 64)
+- **Training steps**: 50000
+- **Gradient clipping**: 1.0
+## Performance
+### Perplexity (TinyStories validation)
+| Exit Layer | Perplexity | Speed (tok/s) |
+|------------|------------|---------------|
+| All layers | TBD | TBD |
+| Layer 18 | TBD | TBD |
+| Layer 12 | TBD | TBD |
+| Layer 6 | TBD | TBD |
+### Training Stability
+- **Gradient norms**: 2-5
+- **Final loss**: TBD
+## Usage
+### Installation
+```bash
+pip install transformers torch
+```
+### Basic Inference
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+# Load model
+model = AutoModelForCausalLM.from_pretrained("your-username/bitskip-v1-earlyexit")
+tokenizer = AutoTokenizer.from_pretrained("your-username/bitskip-v1-earlyexit")
+# Generate text
+inputs = tokenizer("Once upon a time", return_tensors="pt")
+outputs = model.generate(**inputs, max_length=100)
+print(tokenizer.decode(outputs[0]))
+```
+### Early Exit Inference
+```python
+# Exit at layer 12 for faster inference
+model.set_exit_layer(12)
+outputs = model.generate(**inputs, max_length=100)
+# 1.5-2x faster with minimal quality loss
+```
+### Benchmark Different Exit Layers
+```python
+for exit_layer in [6, 12, 18, 24]:
+    model.set_exit_layer(exit_layer)
+    outputs = model.generate(**inputs, max_length=100)
+    print(f"Layer {exit_layer}: {tokenizer.decode(outputs[0])}")
+```
+## Limitations
+- **Inference speed**: Quantized models use fake quantization (QAT) without specialized kernels, resulting in slower inference than full-precision despite lower bit-width
+- **Training instability**: 4-bit models (v2) exhibit gradient explosion (norms 50-110) requiring careful hyperparameter tuning
+- **Dataset scope**: Trained only on TinyStories; may not generalize to other domains without fine-tuning
+## Citation
+If you use this model, please cite:
+```bibtex
+@article{bitnet,
+  title={BitNet: Scaling 1-bit Transformers for Large Language Models},
+  author={Wang, Hongyu and Ma, Shuming and Dong, Li and others},
+  journal={arXiv preprint arXiv:2310.11453},
+  year={2023}
+}
+@article{layerskip,
+  title={LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding},
+  author={Elhoushi, Mostafa and Shrivastava, Akshat and Liskovich, Diana and others},
+  journal={arXiv preprint arXiv:2404.16710},
+  year={2024}
+}
+```
+## License
+MIT License
+## Contact
+For questions or issues, please open an issue on the model repository.

config.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "architectures": [
+    "BitSkipV1ForCausalLMWithEarlyExit"
+  ],
+  "auto_map": {
+    "AutoConfig": "model_v1_earlyexit.BitSkipV1EarlyExitConfig",
+    "AutoModelForCausalLM": "model_v1_earlyexit.BitSkipV1ForCausalLMWithEarlyExit"
+  },
+  "early_exit_loss_weight": 0.3,
+  "hidden_size": 2048,
+  "inference_exit_layer": null,
+  "intermediate_size": 4096,
+  "max_dropout_prob": 0.5,
+  "max_position_embeddings": 2048,
+  "model_type": "bitskip_v1_earlyexit",
+  "num_attention_heads": 32,
+  "num_hidden_layers": 24,
+  "num_key_value_heads": 8,
+  "rms_norm_eps": 1e-05,
+  "rope_theta": 10000.0,
+  "torch_dtype": "float32",
+  "transformers_version": "4.45.2",
+  "vocab_size": 50257
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "_from_model_config": true,
+  "transformers_version": "4.45.2"
+}

inference.py ADDED Viewed

	@@ -0,0 +1,37 @@

+"""
+Inference script for bitskip-v1-earlyexit
+"""
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+def main():
+    # Load from HuggingFace Hub or local path
+    model_path = "."  # Current directory or specify repo_id
+    print("Loading model...")
+    model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
+    tokenizer = AutoTokenizer.from_pretrained(model_path)
+    model.eval()
+    print("Model loaded!")
+    # Example generation
+    prompt = "Once upon a time"
+    inputs = tokenizer(prompt, return_tensors="pt")
+    print(f"\nPrompt: {prompt}\n")
+    # Full model
+    print("Generating with all layers...")
+    outputs = model.generate(**inputs, max_length=100, pad_token_id=tokenizer.eos_token_id)
+    print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+    # Early exit at layer 12
+    print("\nGenerating with early exit at layer 12...")
+    model.set_exit_layer(12)
+    outputs = model.generate(**inputs, max_length=100, pad_token_id=tokenizer.eos_token_id)
+    print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+if __name__ == "__main__":
+    main()

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:643aa06158d00a0975ba01c1b8abd6b35d32df0c1051cb20b1f9dae331d9e958
+size 3834689608

models/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Model files for bitskip-v1-earlyexit"""

models/bitlinear.py ADDED Viewed

	@@ -0,0 +1,51 @@

+"""
+Standard BitLinear layer for BitSkip v1 (8-bit activations, NO Hadamard transform)
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+class BitLinear(nn.Module):
+    """
+    Standard BitLinear: Ternary weights + 8-bit activations.
+    NO Hadamard transform - direct quantization.
+    """
+    def __init__(self, in_features, out_features, bias=False):
+        super().__init__()
+        self.in_features = in_features
+        self.out_features = out_features
+        # Standard weight initialization
+        self.weight = nn.Parameter(torch.randn(out_features, in_features) * 0.02)
+        self.bias = nn.Parameter(torch.zeros(out_features)) if bias else None
+    def forward(self, x):
+        """
+        Forward with 8-bit activation quantization and ternary weights.
+        Uses STE (Straight-Through Estimator) for gradients.
+        """
+        # 8-bit activation quantization
+        x_scale = x.abs().max(dim=-1, keepdim=True)[0].clamp(min=1e-5)
+        x_quant = (x / x_scale * 127).round().clamp(-128, 127)
+        x_quant = x_quant / 127 * x_scale
+        # STE: quantized forward, full precision backward
+        if self.training:
+            x_quant = x + (x_quant - x).detach()
+        # Ternary weight quantization
+        w_scale = self.weight.abs().mean().clamp(min=1e-5)
+        w_quant = torch.zeros_like(self.weight)
+        w_quant[self.weight > 0.5 * w_scale] = 1.0
+        w_quant[self.weight < -0.5 * w_scale] = -1.0
+        w_quant = w_quant * w_scale
+        # STE for weights
+        if self.training:
+            w_quant = self.weight + (w_quant - self.weight).detach()
+        # Standard linear operation
+        return F.linear(x_quant, w_quant, self.bias)

models/model_v1_earlyexit.py ADDED Viewed

	@@ -0,0 +1,450 @@

+"""
+BitSkip v1 with Early Exit Loss and Quadratic Dropout
+- BitLinear quantization (8-bit)
+- Quadratic layer dropout (0 to 0.5 progression, sum=1 constraint)
+- Early exit loss from all layers
+- HuggingFace compatible
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import math
+from transformers import PreTrainedModel, PretrainedConfig, GenerationMixin
+from transformers.modeling_outputs import CausalLMOutputWithPast
+from typing import Optional, Tuple
+from .bitlinear import BitLinear
+class BitSkipV1EarlyExitConfig(PretrainedConfig):
+    model_type = "bitskip_v1_earlyexit"
+    def __init__(
+        self,
+        vocab_size=50257,
+        hidden_size=2048,
+        num_hidden_layers=24,
+        num_attention_heads=32,
+        num_key_value_heads=8,
+        intermediate_size=4096,
+        max_position_embeddings=2048,
+        rms_norm_eps=1e-5,
+        rope_theta=10000.0,
+        # Early exit parameters
+        early_exit_loss_weight=0.3,
+        # Quadratic dropout parameters
+        max_dropout_prob=0.5,
+        # Inference
+        inference_exit_layer=None,
+        **kwargs
+    ):
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+        self.intermediate_size = intermediate_size
+        self.max_position_embeddings = max_position_embeddings
+        self.rms_norm_eps = rms_norm_eps
+        self.rope_theta = rope_theta
+        self.early_exit_loss_weight = early_exit_loss_weight
+        self.max_dropout_prob = max_dropout_prob
+        self.inference_exit_layer = inference_exit_layer
+        super().__init__(**kwargs)
+class QuadraticLayerDropout(nn.Module):
+    """
+    Quadratic layer dropout: p_l = p_max * (l/L)^2
+    Normalized so sum of probabilities = 1
+    """
+    def __init__(self, num_layers, max_dropout_prob=0.5):
+        super().__init__()
+        self.num_layers = num_layers
+        self.max_dropout_prob = max_dropout_prob
+        # Compute quadratic dropout probabilities
+        dropout_probs = []
+        for i in range(num_layers):
+            # Quadratic: p_l = p_max * (l/L)^2
+            prob = max_dropout_prob * ((i / max(num_layers - 1, 1)) ** 2)
+            dropout_probs.append(prob)
+        # Normalize so sum = 1 (as per requirement)
+        total_prob = sum(dropout_probs)
+        if total_prob > 0:
+            dropout_probs = [p / total_prob for p in dropout_probs]
+        self.dropout_probs = dropout_probs
+    def should_drop_layer(self, layer_idx):
+        """Returns True if layer should be dropped during training."""
+        if not self.training or layer_idx >= self.num_layers - 1:  # Never drop last layer
+            return False
+        return torch.rand(1).item() < self.dropout_probs[layer_idx]
+class RMSNorm(nn.Module):
+    def __init__(self, hidden_size, eps=1e-6):
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+    def forward(self, hidden_states):
+        input_dtype = hidden_states.dtype
+        hidden_states = hidden_states.to(torch.float32)
+        variance = hidden_states.pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+        return self.weight * hidden_states.to(input_dtype)
+class RotaryEmbedding(nn.Module):
+    def __init__(self, dim, max_position_embeddings=2048, base=10000):
+        super().__init__()
+        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
+        self.register_buffer("inv_freq", inv_freq)
+    def forward(self, x, position_ids):
+        inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
+        position_ids_expanded = position_ids[:, None, :].float()
+        freqs = (inv_freq_expanded @ position_ids_expanded).transpose(1, 2)
+        emb = torch.cat((freqs, freqs), dim=-1)
+        return emb.cos().to(x.dtype), emb.sin().to(x.dtype)
+def rotate_half(x):
+    x1, x2 = x[..., :x.shape[-1]//2], x[..., x.shape[-1]//2:]
+    return torch.cat((-x2, x1), dim=-1)
+def apply_rotary_pos_emb(q, k, cos, sin):
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+class BitSkipV1Attention(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.num_heads = config.num_attention_heads
+        self.head_dim = self.hidden_size // self.num_heads
+        self.num_key_value_heads = config.num_key_value_heads
+        self.num_key_value_groups = self.num_heads // self.num_key_value_heads
+        self.q_proj = BitLinear(self.hidden_size, self.num_heads * self.head_dim)
+        self.k_proj = BitLinear(self.hidden_size, self.num_key_value_heads * self.head_dim)
+        self.v_proj = BitLinear(self.hidden_size, self.num_key_value_heads * self.head_dim)
+        self.o_proj = BitLinear(self.hidden_size, self.hidden_size)
+        self.rotary_emb = RotaryEmbedding(self.head_dim, config.max_position_embeddings, config.rope_theta)
+    def forward(self, hidden_states, attention_mask=None, position_ids=None, past_key_value=None, use_cache=False):
+        bsz, q_len, _ = hidden_states.size()
+        query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+        key_states = self.k_proj(hidden_states).view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        value_states = self.v_proj(hidden_states).view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        cos, sin = self.rotary_emb(value_states, position_ids)
+        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+        if past_key_value is not None:
+            key_states = torch.cat([past_key_value[0], key_states], dim=2)
+            value_states = torch.cat([past_key_value[1], value_states], dim=2)
+        past_key_value = (key_states, value_states) if use_cache else None
+        key_states = key_states.repeat_interleave(self.num_key_value_groups, dim=1)
+        value_states = value_states.repeat_interleave(self.num_key_value_groups, dim=1)
+        attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
+        if attention_mask is not None:
+            attn_weights = attn_weights + attention_mask
+        attn_weights = nn.functional.softmax(attn_weights, dim=-1)
+        attn_output = torch.matmul(attn_weights, value_states)
+        attn_output = attn_output.transpose(1, 2).contiguous().reshape(bsz, q_len, self.hidden_size)
+        attn_output = self.o_proj(attn_output)
+        return attn_output, None, past_key_value
+class BitSkipV1MLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.gate_proj = BitLinear(config.hidden_size, config.intermediate_size)
+        self.up_proj = BitLinear(config.hidden_size, config.intermediate_size)
+        self.down_proj = BitLinear(config.intermediate_size, config.hidden_size)
+    def forward(self, x):
+        return self.down_proj(nn.functional.silu(self.gate_proj(x)) * self.up_proj(x))
+class BitSkipV1DecoderLayer(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.self_attn = BitSkipV1Attention(config)
+        self.mlp = BitSkipV1MLP(config)
+        self.input_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.post_attention_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+    def forward(self, hidden_states, attention_mask=None, position_ids=None, past_key_value=None, use_cache=False):
+        residual = hidden_states
+        hidden_states = self.input_layernorm(hidden_states)
+        hidden_states, _, present_key_value = self.self_attn(
+            hidden_states, attention_mask, position_ids, past_key_value, use_cache
+        )
+        hidden_states = residual + hidden_states
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+        return (hidden_states,) + ((present_key_value,) if use_cache else ())
+class BitSkipV1PreTrainedModel(PreTrainedModel):
+    config_class = BitSkipV1EarlyExitConfig
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    def _init_weights(self, module):
+        if isinstance(module, (nn.Linear, BitLinear)):
+            if hasattr(module, 'weight'):
+                module.weight.data.normal_(mean=0.0, std=0.02)
+            if hasattr(module, 'bias') and module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=0.02)
+class BitSkipV1Model(BitSkipV1PreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)
+        self.layers = nn.ModuleList([BitSkipV1DecoderLayer(config) for _ in range(config.num_hidden_layers)])
+        self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.gradient_checkpointing = False
+        # Quadratic dropout module
+        self.layer_dropout = QuadraticLayerDropout(config.num_hidden_layers, config.max_dropout_prob)
+        self.post_init()
+    def forward(self, input_ids, attention_mask=None, position_ids=None, past_key_values=None, use_cache=False, output_hidden_states=False, return_all_layer_outputs=False):
+        hidden_states = self.embed_tokens(input_ids)
+        if position_ids is None:
+            position_ids = torch.arange(input_ids.shape[1], dtype=torch.long, device=input_ids.device)
+            position_ids = position_ids.unsqueeze(0)
+        next_decoder_cache = () if use_cache else None
+        all_layer_hidden_states = []  # Store ALL layer outputs for early exit loss
+        # Determine layers to run
+        num_layers_to_run = self.config.inference_exit_layer if self.config.inference_exit_layer else len(self.layers)
+        num_layers_to_run = min(num_layers_to_run, len(self.layers))
+        for idx in range(num_layers_to_run):
+            layer = self.layers[idx]
+            past_key_value = past_key_values[idx] if past_key_values else None
+            # Apply quadratic dropout during training
+            if self.training and self.layer_dropout.should_drop_layer(idx):
+                # Skip this layer - hidden_states pass through unchanged
+                all_layer_hidden_states.append(hidden_states)
+                continue
+            if self.gradient_checkpointing and self.training:
+                layer_outputs = self._gradient_checkpointing_func(
+                    layer.__call__,
+                    hidden_states,
+                    attention_mask,
+                    position_ids,
+                    past_key_value,
+                    use_cache,
+                )
+            else:
+                layer_outputs = layer(hidden_states, attention_mask, position_ids, past_key_value, use_cache)
+            hidden_states = layer_outputs[0]
+            all_layer_hidden_states.append(hidden_states)
+            if use_cache:
+                next_decoder_cache += (layer_outputs[1],)
+        hidden_states = self.norm(hidden_states)
+        all_layer_hidden_states.append(hidden_states)  # Add final normed output
+        if return_all_layer_outputs:
+            return hidden_states, next_decoder_cache, all_layer_hidden_states
+        else:
+            return hidden_states, next_decoder_cache, None
+class BitSkipV1ForCausalLMWithEarlyExit(BitSkipV1PreTrainedModel, GenerationMixin):
+    """BitSkip v1 with early exit loss."""
+    _tied_weights_keys = ["lm_head.weight"]
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = BitSkipV1Model(config)
+        self.vocab_size = config.vocab_size
+        # Shared LM head for all layers (LayerSkip approach)
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.model.embed_tokens
+    def set_input_embeddings(self, value):
+        self.model.embed_tokens = value
+    def get_output_embeddings(self):
+        return self.lm_head
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+    def compute_early_exit_loss(self, all_layer_hidden_states, labels):
+        """
+        Compute early exit loss from all layers.
+        Uses layer-proportional weighting: w_i = i/N
+        """
+        num_layers = len(all_layer_hidden_states)
+        # Compute weights: layer-proportional (deeper layers get more weight)
+        weights = [(i + 1) / num_layers for i in range(num_layers)]
+        weight_sum = sum(weights)
+        weights = [w / weight_sum for w in weights]  # Normalize
+        total_exit_loss = 0.0
+        for i, hidden_states in enumerate(all_layer_hidden_states):
+            # Get logits for this layer
+            logits = self.lm_head(hidden_states)
+            # Compute cross-entropy loss
+            shift_logits = logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            loss_fct = nn.CrossEntropyLoss()
+            layer_loss = loss_fct(shift_logits.view(-1, self.vocab_size), shift_labels.view(-1))
+            # Weight and accumulate
+            total_exit_loss += weights[i] * layer_loss
+        return total_exit_loss
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        position_ids=None,
+        past_key_values=None,
+        inputs_embeds=None,
+        labels=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # Get all layer outputs if training with labels (for early exit loss)
+        return_all = self.training and labels is not None
+        hidden_states, past_key_values_output, all_layer_hidden_states = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_hidden_states=output_hidden_states,
+            return_all_layer_outputs=return_all,
+        )
+        logits = self.lm_head(hidden_states)
+        logits = logits.float()
+        loss = None
+        if labels is not None:
+            # Main loss (final layer)
+            shift_logits = logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            loss_fct = nn.CrossEntropyLoss()
+            main_loss = loss_fct(shift_logits.view(-1, self.vocab_size), shift_labels.view(-1))
+            # Early exit loss (all intermediate layers)
+            if all_layer_hidden_states is not None and len(all_layer_hidden_states) > 0:
+                early_exit_loss = self.compute_early_exit_loss(all_layer_hidden_states[:-1], labels)  # Exclude final layer
+                # Combine: main + weighted early exit
+                loss = main_loss + self.config.early_exit_loss_weight * early_exit_loss
+            else:
+                loss = main_loss
+        if not return_dict:
+            output = (logits,) + (past_key_values_output,)
+            return (loss,) + output if loss is not None else output
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=past_key_values_output,
+            hidden_states=None,
+            attentions=None,
+        )
+    def prepare_inputs_for_generation(self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs):
+        if past_key_values is not None:
+            past_length = past_key_values[0][0].shape[2]
+            if input_ids.shape[1] > past_length:
+                remove_prefix_length = past_length
+            else:
+                remove_prefix_length = input_ids.shape[1] - 1
+            input_ids = input_ids[:, remove_prefix_length:]
+        position_ids = kwargs.get("position_ids", None)
+        if attention_mask is not None and position_ids is None:
+            position_ids = attention_mask.long().cumsum(-1) - 1
+            position_ids.masked_fill_(attention_mask == 0, 1)
+            if past_key_values:
+                position_ids = position_ids[:, -input_ids.shape[1] :]
+        if inputs_embeds is not None and past_key_values is None:
+            model_inputs = {"inputs_embeds": inputs_embeds}
+        else:
+            model_inputs = {"input_ids": input_ids}
+        model_inputs.update({
+            "position_ids": position_ids,
+            "past_key_values": past_key_values,
+            "use_cache": kwargs.get("use_cache"),
+            "attention_mask": attention_mask,
+        })
+        return model_inputs
+    @staticmethod
+    def _reorder_cache(past_key_values, beam_idx):
+        reordered_past = ()
+        for layer_past in past_key_values:
+            reordered_past += (
+                tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
+            )
+        return reordered_past
+    def set_exit_layer(self, exit_layer):
+        """Set early exit layer for inference."""
+        self.config.inference_exit_layer = exit_layer
+        self.model.config.inference_exit_layer = exit_layer
+BitSkipV1EarlyExitConfig.register_for_auto_class()
+BitSkipV1ForCausalLMWithEarlyExit.register_for_auto_class("AutoModelForCausalLM")

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "bos_token": "<|endoftext|>",
+  "eos_token": "<|endoftext|>",
+  "pad_token": "<|endoftext|>",
+  "unk_token": "<|endoftext|>"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "50256": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<|endoftext|>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|endoftext|>",
+  "model_max_length": 1024,
+  "pad_token": "<|endoftext|>",
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>"
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff