You already know how LLMs work — text into tokens, tokens into math, predict the next one. Image generation uses the same broad ideas but flips the training game: instead of predicting the next token, the model learns to predict and remove noise. Starting from pure static, it chips away — step by step — until a coherent image emerges. What does Michelangelo have to do with any of this? More than you’d think. This is how image diffusion models work, in 20 minutes.

Full shownotes at fragmentedpodcast.com.

Show Notes #

Episode 303 - How LLMs work in 20 minutes - text generation
VAE - Variational Autoencoder
RGB Color model - wikipedia
Word2Vec technique - wikipedia
- Efficient Estimation of Word Representation - original Word2Vec paper by Mikolov et al.
High-Resolution Image Synthesis with Latent Diffusion Models - Rombach et al. (2022) — the paper behind Stable Diffusion
Image Training data
- LAION-5B - 5 billion image-text pairs scraped from the web, used to train many image generation models
- WebLI - Google’s internal image-text dataset
Michelangelo

Get in touch #

We’d love to hear from you. Email is the best way to reach us or you can check our contact page for other ways.

We want to hear all the feedback: what’s working, what’s not, topics you’d like to hear more on.

Co-hosts: #

We transitioned from Android development to AI starting with

Ep. #300. Listen to that episode for the full story behind our new direction.

308 - How Image Diffusion Models Work - the 20 minute explainer

March 24, 2026

Show Notes #

Get in touch #

Co-hosts: #

Show Notes #

Get in touch #

Co-hosts: #

You might also enjoy