Claude now can do AI Research Experiments with this AI Research Engineering SKILLS

Community Article Published December 4, 2025

Modularized Knowledge for Various AI Research Experiments

By Zechen Zhang and Amber Liu

GitHub


The rise of frontier models like ChatGPT was driven by scaling, not just algorithmic breakthroughs. Scaling laws revealed that more compute, data, and larger models unlock emergent intelligence that smaller systems lack. This is inseparable from the large-scale and efficient LLM training systems. Without powerful AI research engineering skills, we couldn't iterate on ideas fast enough or train trillion-parameter models within feasible time and cost. In short, better AI research engineering skills make faster and better science possible.

Modern AI research is no longer just about designing algorithms—it's also about mastering the AI research engineering skills that capture the entire experimental workflow: how to prepare high-quality datasets; how to configure, train, and deploy models efficiently; how to evaluate and align them safely; and how to document results with scientific rigor. Today's researchers must navigate a complex ecosystem of interdependent tools that collectively make experiments more efficient, reliable, and credible, bridging the gap between having an excellent research idea and realizing this idea.

Yet, this ecosystem evolves too quickly for any individual to keep up. APIs change, arguments are deprecated, and configuration formats shift. While LLM-based coding agents are proficient at general-purpose programming, they often lack domain-specific engineering knowledge. For example, researchers might struggle with how to configure the optimal training parallelism, how to enable the right arguments among framework's overwhelming number of configuration arguments or how to implement your specialized model using their interface, requiring specialized research engineering skills.

That's the motivation behind our AI Research Engineering Skills Library—a unified, modular knowledge base designed to help both humans and AI agents conduct AI research experimentation with various AI research engineering skills and tools.

Introducing the AI Research Engineering Skills Library

Our AI Research Engineering Skills Library is the most comprehensive open-source library of AI research engineering skills—designed to empower AI agents to autonomously conduct AI research via end-to-end scientific experimentation—preparing datasets, executing training pipelines, deploying models, or analyzing results—just like a real researcher. We believe this is one of the most important puzzles towards real AI Research Agent.

ai-research-skills-system-diagram

AI Research Engineering Skills Library

Our focus is on quality over quantity (70 skills across 19 categories). Each skill offers comprehensive, expert-level guidance complete with real-world code examples, troubleshooting guides, and production-ready workflows.

Skills Categories

Model Architecture (5 skills)

  • LitGPT: Lightning AI's library of over 20 clean, optimized LLM implementations with production-grade training recipes.
  • Mamba: State-space model with O(n) complexity, up to 5× faster than Transformers.
  • RWKV: RNN+Transformer hybrid with infinite context length, now a Linux Foundation project.
  • NanoGPT: Andrej Karpathy's educational GPT implementation in ~300 lines of clean code.
  • Gemma-2B: Google's lightweight 2B parameter open model with strong performance on common benchmarks.

Tokenization (2 skills)

  • HuggingFace Tokenizers: Rust-based tokenization processing at <20s/GB, supporting BPE, WordPiece, and Unigram algorithms.
  • SentencePiece: Language-independent tokenization at 50k sentences/sec, used by T5 and ALBERT.

Fine-Tuning (4 skills)

  • Axolotl: YAML-based fine-tuning framework supporting 100+ models.
  • LLaMA-Factory: WebUI-based no-code fine-tuning platform.
  • Unsloth: Optimized library for 2× faster QLoRA fine-tuning.
  • PEFT: Hugging Face's Parameter-Efficient Fine-Tuning library for LoRA, QLoRA, and adapters.

Data Processing (2 skills)

  • Ray Data: Distributed ML data processing with streaming execution and GPU support.
  • NeMo Curator: GPU-accelerated data curation with 16× faster deduplication.

Distributed Training (5 skills)

  • Megatron-Core: NVIDIA's framework for training 2B-462B parameter models with 47% MFU on H100s.
  • DeepSpeed: Microsoft's ZeRO optimization suite for memory-efficient training.
  • PyTorch FSDP: Native PyTorch Fully Sharded Data Parallelism for large-scale distributed training.
  • Accelerate: Hugging Face's 4-line API for managing distributed training with minimal code changes.
  • PyTorch Lightning: High-level training framework with structured Trainer class.

Post-Training (4 skills)

  • TRL Fine-Tuning: Hugging Face's Transformer Reinforcement Learning library for advanced fine-tuning.
  • GRPO-RL-Training: Gold-standard Group Relative Policy Optimization implementation with TRL.
  • OpenRLHF: Complete RLHF pipeline built on Ray and vLLM.
  • SimPO: Simple Preference Optimization algorithm without requiring a reference model.

Inference & Serving (4 skills)

  • vLLM: Production-ready high-throughput LLM serving engine with PagedAttention.
  • TensorRT-LLM: NVIDIA's fastest inference engine achieving 24k tokens/sec with FP8/INT4 quantization.
  • llama.cpp: CPU and Apple Silicon inference with GGUF quantization support.
  • SGLang: Structured generation engine with RadixAttention, 5-10× faster for agent workloads.

Safety & Alignment (3 skills)

  • Constitutional AI: AI-driven self-improvement framework guided by core principles.
  • LlamaGuard: Specialized safety classifier for filtering LLM inputs and outputs.
  • NeMo Guardrails: Programmable guardrails toolkit with Colang for conversational AI.

Optimization (6 skills)

  • Flash Attention: 2-4× faster attention mechanism with improved memory efficiency.
  • bitsandbytes: 8-bit and 4-bit quantization library for 50-75% memory reduction.
  • GPTQ: 4-bit post-training quantization achieving 4× memory reduction with <2% accuracy loss.
  • AWQ: Activation-aware Weight Quantization for efficient 4-bit model compression.
  • HQQ: Half-Quadratic Quantization for fast and accurate weight quantization.
  • GGUF: GPT-Generated Unified Format for CPU/GPU inference with llama.cpp.

Evaluation (1 skill)

  • lm-evaluation-harness: EleutherAI's benchmark standard for evaluating LLMs across 60+ tasks.

Infrastructure (3 skills)

  • Modal: Serverless cloud for ML with instant GPU provisioning and simple Python SDK.
  • SkyPilot: Multi-cloud orchestration for ML workloads with automatic spot instance management.
  • Lambda Labs: Cloud GPU platform optimized for deep learning with pre-configured environments.

Agents (4 skills)

  • LangChain: Most popular agent framework with 500+ integrations, ReAct pattern implementation.
  • LlamaIndex: Data framework for LLM apps with 300+ connectors, RAG-focused architecture.
  • CrewAI: Multi-agent orchestration framework for collaborative AI workflows.
  • AutoGPT: Autonomous AI agent framework for goal-driven task completion.

RAG (5 skills)

  • Chroma: Open-source embedding database, local/cloud deployment, 24k GitHub stars.
  • FAISS: Facebook's similarity search library, billion-scale vectors, GPU acceleration.
  • Sentence Transformers: 5000+ embedding models, multilingual support, 15k GitHub stars.
  • Pinecone: Managed vector database with auto-scaling and <100ms query latency.
  • Qdrant: High-performance vector database with filtering, cloud-native architecture.

Multimodal (7 skills)

  • CLIP: OpenAI's vision-language model with zero-shot classification, 25k GitHub stars.
  • Whisper: Robust speech recognition supporting 99 languages, 73k GitHub stars.
  • LLaVA: Vision-language assistant with image chat capabilities at GPT-4V level.
  • Stable Diffusion: State-of-the-art text-to-image generation with ControlNet support.
  • Segment Anything: Meta's foundation model for image segmentation tasks.
  • BLIP-2: Bootstrapped language-image pre-training for vision-language understanding.
  • AudioCraft: Meta's audio generation library for music and sound effects.

Prompt Engineering (4 skills)

  • DSPy: Programming framework for optimizing LM prompts and weights declaratively.
  • Guidance: Structured output generation with handlebars templates and role-based parsing.
  • Outlines: Constrained generation library with regex and JSON schema support.
  • LangSmith: LLM observability platform with prompt versioning and debugging tools.

MLOps (3 skills)

  • MLflow: Complete ML lifecycle management with experiment tracking and model registry.
  • Weights & Biases: ML experiment tracking with real-time visualization and collaborative dashboards.
  • TensorBoard: TensorFlow's visualization toolkit for tracking metrics and model graphs.

Observability (2 skills)

  • LangSmith: LLM observability platform with tracing, debugging, and prompt versioning.
  • Phoenix: Open-source LLM observability with tracing, evaluation, and dataset management.

Emerging Techniques (6 skills)

  • MoE (Mixture of Experts): Sparse model architecture activating only relevant experts per token.
  • Test-Time Training: Dynamic adaptation during inference for improved accuracy.
  • Multi-Agent Systems: Coordination frameworks for multiple specialized AI agents.
  • Constitutional AI: Self-improving AI systems guided by human-defined principles.
  • Reasoning Models: Advanced inference techniques like chain-of-thought and tree search.
  • Knowledge Distillation: Transferring knowledge from large models to smaller, efficient ones.

The Philosophy Behind Skills

Think of skills as invocable knowledge packages—modular units that contain structured instructions, references, and resources for a specific domain. This concept—pioneered in the transition from Claude's claude.md to skills—marks a new era of modular intelligence. Regardless of future model architectures or context window sizes, this design principle will remain a cornerstone of AGI: separating lightweight reasoning from heavy resource retrieval.

Key ideas:

  • Instant invocation mechanism: Instead of loading all tool descriptions into the model at once, a dedicated tool retriever dynamically searches and loads the right one when needed—just like browsing functions in a directory.
  • Reduced context burden: Tool information is injected only when relevant, avoiding context overflow and resource waste while improving reasoning focus and response efficiency.
  • Higher precision and quality: On-demand loading ensures the model only uses the most relevant tools, reducing interference and miscalls, leading to more focused and interpretable outputs.
  • Simplified version management: Tools can be updated or replaced directly without complex version control, keeping the system lightweight and flexible.

Imagine giving Claude—or any advanced model—a set of skills encapsulating a financial analyst's tools, experience, and workflows. You'd essentially create an AI financial analyst. Extend this to every knowledge domain, and a Skills Marketplace becomes a true digital labor market, capable of simulating any professional role.

As Dario Amodei once emphasized, the simplest solutions often scale the furthest. Skills are precisely that: a simple but transformative step toward general intelligence.

Research Skills in Production

Case Study: Reproducing Cutting-Edge LoRA Research

Want to see these research skills in action? I used the GRPO-RL-Training and PPTX skills from this library to reproduce Thinking Machines Lab's cutting-edge LoRA research—complete with GPU provisioning, experiment tracking, and automated presentation generation.

In just 2 days (instead of the typical 2-3 weeks), I validated their findings on both supervised fine-tuning and reinforcement learning tasks, generated publication-ready plots, and delivered a comprehensive analysis—all through natural language conversations with Orchestra.

Read the full case study →

Democratizing AI Research

We open-sourced this collection of AI Research Engineering Skills because we believe that AI research should not be a privilege any more. Our goal is to democratize the entire AI research workflow, making it accessible to anyone with curiosity, not just those with compute clusters or elite research engineering teams.

By encapsulating AI research engineering knowledge into reusable, modular, and invocable skills, a student, a startup founder, or a cross-disciplinary researcher in a developing region can now use the same experimental capabilities as top AI labs—launching large-scale training, evaluating models, and running reproducible experiments with the help of AI agents.

A Future Without Gatekeepers

We envision a future where every individual can contribute to AI research, where the frontier of science is no longer gated by hardware or institutional privilege, but powered by collective intelligence and creativity.

The AI Research Engineering Skills Library is our contribution to this vision—a step toward a world where groundbreaking research is limited only by imagination, not resources.


Explore the AI Research Engineering Skills Library on GitHub, and start conducting AI research experiments by just prompting Orchestra Research.


Related Perspectives


© AmberLJC

Community

Sign up or log in to comment