Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of training
                        on large amounts of noisy image-text data, without relying on expensive accurate labels used in
                        standard vision unimodal supervised learning. The resulting models showed capabilities of strong
                        text-guided image generation and transfer to downstream tasks, while performing remarkably at
                        zero-shot classification with noteworthy out-of-distribution robustness. Since then, large-scale
                        language-vision models like ALIGN, BASIC, GLIDE, Flamingo and Imagen made further improvements.
                        Studying the training and capabilities of such models requires datasets containing billions of
                        image-text pairs. Until now, no datasets of this size have been made openly available for the
                        broader research community. To address this problem and democratize research on large-scale
                        multi-modal models, we present LAION-5B - a dataset consisting of 5.85 billion CLIP-filtered
                        image-text pairs, of which 2.32B contain English language. We show successful replication and
                        fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset, and
                        discuss further experiments enabled with an openly available dataset of this scale. Additionally
                        we provide several nearest neighbor indices, an improved web-interface for dataset exploration
                        and subset generation, and detection scores for watermark, NSFW, and toxic content detection.