Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of training
on large amounts of noisy image-text data, without relying on expensive accurate labels used in
standard vision unimodal supervised learning. The resulting models showed capabilities of strong
text-guided image generation and transfer to downstream tasks, while performing remarkably at
zero-shot classification with noteworthy out-of-distribution robustness. Since then, large-scale
language-vision models like ALIGN, BASIC, GLIDE, Flamingo and Imagen made further improvements.
Studying the training and capabilities of such models requires datasets containing billions of
image-text pairs. Until now, no datasets of this size have been made openly available for the
broader research community. To address this problem and democratize research on large-scale
multi-modal models, we present LAION-5B - a dataset consisting of 5.85 billion CLIP-filtered
image-text pairs, of which 2.32B contain English language. We show successful replication and
fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset, and
discuss further experiments enabled with an openly available dataset of this scale. Additionally
we provide several nearest neighbor indices, an improved web-interface for dataset exploration
and subset generation, and detection scores for watermark, NSFW, and toxic content detection.