// arxiv
The foundational papers that define modern large language models. Read these in order.
2017
Attention Is All You Need
Vaswani et al. (Google Brain)
THE transformer paper. Every modern LLM is built on this architecture.
Read on arXiv →2018
BERT: Pre-training of Deep Bidirectional Transformers
Devlin et al. (Google)
Established masked language modeling pre-training. Still widely used.
Read on arXiv →2020
Language Models are Few-Shot Learners (GPT-3)
Brown et al. (OpenAI)
175B parameter model that showed emergent capabilities at scale.
Read on arXiv →2022
Constitutional AI: Harmlessness from AI Feedback
Bai et al. (Anthropic)
How Claude is trained to be helpful and safe. Core Anthropic technique.
Read on arXiv →2020
Scaling Laws for Neural Language Models
Kaplan et al. (OpenAI)
Predicts how model loss scales with compute, data, and parameters.
Read on arXiv →2022
Training Language Models to Follow Instructions (InstructGPT)
Ouyang et al. (OpenAI)
Introduced RLHF pipeline. The paper that created the 'assistant' paradigm.
Read on arXiv →2023
LLaMA: Open and Efficient Foundation Language Models
Touvron et al. (Meta AI)
Best documented open-source LLM. Great for studying architecture.
Read on arXiv →2022
FlashAttention: Fast and Memory-Efficient Exact Attention
Dao et al.
IO-aware exact attention. Critical for training long-context models.
Read on arXiv →2023
Direct Preference Optimization (DPO)
Rafailov et al.
Simplified RLHF without separate reward model. Widely adopted.
Read on arXiv →2021
LoRA: Low-Rank Adaptation of Large Language Models
Hu et al. (Microsoft)
Fine-tune 70B models on a single GPU. Most useful paper for practitioners.
Read on arXiv →2017
Proximal Policy Optimization Algorithms
Schulman et al. (OpenAI)
The RL algorithm that powers RLHF training. Must understand.
Read on arXiv →2023
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Bubeck et al. (Microsoft)
Fascinating analysis of GPT-4's emergent capabilities and reasoning.
Read on arXiv →2023
GPT-4 Technical Report
OpenAI
State-of-the-art LLM. Understand what the frontier looks like.
Read on arXiv →2019
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Rajbhandari et al. (Microsoft)
Distributed training memory optimization. You need this to train large models.
Read on arXiv →2014
Adam: A Method for Stochastic Optimization
Kingma & Ba
The optimizer behind almost every LLM. Understand it at a deep level.
Read on arXiv →2021
RoPE: Rotary Position Embedding
Su et al.
Position encoding used in LLaMA, Mistral, Claude. Better than learned.
Read on arXiv →2020
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Gao et al. (EleutherAI)
How to build training datasets at scale. Data quality matters most.
Read on arXiv →2024
Mixtral of Experts
Jiang et al. (Mistral AI)
Best open MoE model. Understanding sparse expert routing.
Read on arXiv →