-
QLoRA: Efficient Finetuning of Quantized LLMs
Paper • 2305.14314 • Published • 57 -
Training Transformers with 4-bit Integers
Paper • 2306.11987 • Published • 22 -
FasterViT: Fast Vision Transformers with Hierarchical Attention
Paper • 2306.06189 • Published • 31 -
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
Paper • 2309.14509 • Published • 19
Collections
Discover the best community collections!
Collections including paper arxiv:2312.04985
-
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Paper • 2312.00752 • Published • 148 -
SparQ Attention: Bandwidth-Efficient LLM Inference
Paper • 2312.04985 • Published • 40 -
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Paper • 2402.00159 • Published • 65 -
Neural Network Diffusion
Paper • 2402.13144 • Published • 99
-
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
Paper • 2311.07689 • Published • 9 -
DiLoCo: Distributed Low-Communication Training of Language Models
Paper • 2311.08105 • Published • 16 -
SparQ Attention: Bandwidth-Efficient LLM Inference
Paper • 2312.04985 • Published • 40 -
Aligning Large Language Models with Counterfactual DPO
Paper • 2401.09566 • Published • 2
-
Ziya2: Data-centric Learning is All LLMs Need
Paper • 2311.03301 • Published • 20 -
Co-training and Co-distillation for Quality Improvement and Compression of Language Models
Paper • 2311.02849 • Published • 8 -
MFTCoder: Boosting Code LLMs with Multitask Fine-Tuning
Paper • 2311.02303 • Published • 12 -
ADaPT: As-Needed Decomposition and Planning with Language Models
Paper • 2311.05772 • Published • 15
-
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Paper • 2312.00752 • Published • 148 -
SparQ Attention: Bandwidth-Efficient LLM Inference
Paper • 2312.04985 • Published • 40 -
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
Paper • 2401.04658 • Published • 27 -
E^2-LLM: Efficient and Extreme Length Extension of Large Language Models
Paper • 2401.06951 • Published • 26
-
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper • 2311.01282 • Published • 37 -
Co-training and Co-distillation for Quality Improvement and Compression of Language Models
Paper • 2311.02849 • Published • 8 -
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Paper • 2311.04934 • Published • 34 -
Exponentially Faster Language Modelling
Paper • 2311.10770 • Published • 119
-
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Paper • 2311.04934 • Published • 34 -
Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models
Paper • 2311.08692 • Published • 13 -
Exponentially Faster Language Modelling
Paper • 2311.10770 • Published • 119 -
Memory Augmented Language Models through Mixture of Word Experts
Paper • 2311.10768 • Published • 19
-
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper • 2311.01282 • Published • 37 -
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Paper • 2311.03285 • Published • 32 -
Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization
Paper • 2311.06243 • Published • 22 -
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores
Paper • 2311.05908 • Published • 16
-
QLoRA: Efficient Finetuning of Quantized LLMs
Paper • 2305.14314 • Published • 57 -
Training Transformers with 4-bit Integers
Paper • 2306.11987 • Published • 22 -
FasterViT: Fast Vision Transformers with Hierarchical Attention
Paper • 2306.06189 • Published • 31 -
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
Paper • 2309.14509 • Published • 19
-
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Paper • 2312.00752 • Published • 148 -
SparQ Attention: Bandwidth-Efficient LLM Inference
Paper • 2312.04985 • Published • 40 -
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
Paper • 2401.04658 • Published • 27 -
E^2-LLM: Efficient and Extreme Length Extension of Large Language Models
Paper • 2401.06951 • Published • 26
-
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Paper • 2312.00752 • Published • 148 -
SparQ Attention: Bandwidth-Efficient LLM Inference
Paper • 2312.04985 • Published • 40 -
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Paper • 2402.00159 • Published • 65 -
Neural Network Diffusion
Paper • 2402.13144 • Published • 99
-
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper • 2311.01282 • Published • 37 -
Co-training and Co-distillation for Quality Improvement and Compression of Language Models
Paper • 2311.02849 • Published • 8 -
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Paper • 2311.04934 • Published • 34 -
Exponentially Faster Language Modelling
Paper • 2311.10770 • Published • 119
-
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
Paper • 2311.07689 • Published • 9 -
DiLoCo: Distributed Low-Communication Training of Language Models
Paper • 2311.08105 • Published • 16 -
SparQ Attention: Bandwidth-Efficient LLM Inference
Paper • 2312.04985 • Published • 40 -
Aligning Large Language Models with Counterfactual DPO
Paper • 2401.09566 • Published • 2
-
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Paper • 2311.04934 • Published • 34 -
Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models
Paper • 2311.08692 • Published • 13 -
Exponentially Faster Language Modelling
Paper • 2311.10770 • Published • 119 -
Memory Augmented Language Models through Mixture of Word Experts
Paper • 2311.10768 • Published • 19
-
Ziya2: Data-centric Learning is All LLMs Need
Paper • 2311.03301 • Published • 20 -
Co-training and Co-distillation for Quality Improvement and Compression of Language Models
Paper • 2311.02849 • Published • 8 -
MFTCoder: Boosting Code LLMs with Multitask Fine-Tuning
Paper • 2311.02303 • Published • 12 -
ADaPT: As-Needed Decomposition and Planning with Language Models
Paper • 2311.05772 • Published • 15
-
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper • 2311.01282 • Published • 37 -
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Paper • 2311.03285 • Published • 32 -
Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization
Paper • 2311.06243 • Published • 22 -
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores
Paper • 2311.05908 • Published • 16