Training A Tiny Diffusion Model For Pixel Art Tilemaps With Denoising Autoencoder Loss

The weird problem I stumbled into

I wanted to generate pixel-art tilemaps (think: 16×16 tiles arranged into a small scene), but I kept running into a very specific failure mode: my outputs looked “smudged” and washed out. The same happened when I tried common denoisers—noise removal helped, but the color blocks blurred like wet watercolor.

After a few weekend sessions of tinkering, I ended up with a niche approach that actually matched the data: instead of treating pixel art like natural images, I trained a tiny diffusion-style denoising model that uses a denoising autoencoder loss designed for discrete-ish color blocks. The result wasn’t perfect, but it was dramatically less smeary and much more “tilemap-like.”

In this post I’ll walk through a working, end-to-end prototype in PyTorch:

Represent tilemaps as small RGB images
Train a tiny denoiser with a diffusion-noise schedule
Use the denoising objective at random timesteps
Sample by iteratively removing noise

No external datasets are needed: I generate a toy tilemap dataset on the fly so the code runs anywhere.

Background: what I mean by “diffusion-style denoising”

A diffusion model typically learns to reverse a “noising process” that turns a clean image into noise. Here I’m implementing a lightweight variant:

Take a clean image x0.
Choose a timestep t.
Add noise to get a corrupted image xt.
Train a network to predict the original image (or the noise).
At sampling time, start from random noise and iteratively denoise.

This version uses a denoising autoencoder loss: the network sees xt and is trained to reconstruct the clean image x0.

A dataset that actually looks like tilemaps

Pixel art tilemaps have sharp edges and repeated structures. To mimic that, I generate small scenes by:

Choosing a palette of colors
Drawing a few “tile types” (floor, wall, water, sky)
Assembling them into an 8×8 grid of tiles, then resizing to a low-res image

Below is a toy generator that produces images of shape (3, 32, 32).

Step 1: Build a toy tilemap dataset

import math
import random
import torch
from torch.utils.data import Dataset, DataLoader

class TileMapDataset(Dataset):
    """
    Produces toy 'pixel art tilemaps' as low-res RGB images.
    Output: float tensor in [0, 1] with shape (3, H, W).
    """
    def __init__(self, n_samples=2000, tile_size=4, grid_size=8, seed=0):
        super().__init__()
        self.n_samples = n_samples
        self.tile_size = tile_size
        self.grid_size = grid_size
        self.H = grid_size * tile_size
        self.W = grid_size * tile_size

        rng = random.Random(seed)
        # A small palette: floor, wall, water, sky (+ some accent colors)
        self.palette = [
            (30, 30, 30),     # dark floor
            (210, 210, 210),  # wall
            (40, 120, 220),   # water
            (130, 200, 255),  # sky
            (240, 180, 60),   # accent
            (60, 220, 120),   # grass
        ]
        # Normalize palette to [0,1] tensors later

        self.rng = rng

    def _pick_color(self, idx):
        r, g, b = self.palette[idx]
        return torch.tensor([r, g, b], dtype=torch.float32) / 255.0

    def _generate_one(self):
        ts = self.tile_size
        gs = self.grid_size
        H = self.H
        W = self.W

        img = torch.zeros(3, H, W, dtype=torch.float32)

        # Choose a "scene theme"
        theme = self.rng.choice(["cave", "island", "sky"])
        # Create a base map of tile types
        # 0 floor, 1 wall, 2 water, 3 sky, 4 accent, 5 grass
        grid = [[0 for _ in range(gs)] for _ in range(gs)]

        # Basic layout patterns
        if theme == "cave":
            # Random walls and a watery pool
            for y in range(gs):
                for x in range(gs):
                    p = self.rng.random()
                    if p < 0.18:
                        grid[y][x] = 1
                    else:
                        grid[y][x] = 0
            # carve a small "water" rectangle
            x0 = self.rng.randint(1, gs-3)
            y0 = self.rng.randint(1, gs-3)
            w = self.rng.randint(2, 4)
            h = self.rng.randint(2, 4)
            for y in range(y0, min(gs, y0+h)):
                for x in range(x0, min(gs, x0+w)):
                    grid[y][x] = 2

        elif theme == "island":
            # grass around, water at the center, walls as rocks
            cx = gs // 2 + self.rng.randint(-1, 1)
            cy = gs // 2 + self.rng.randint(-1, 1)
            for y in range(gs):
                for x in range(gs):
                    d = math.sqrt((x-cx)**2 + (y-cy)**2)
                    if d < gs * 0.18:
                        grid[y][x] = 2
                    elif d < gs * 0.38:
                        grid[y][x] = 5
                    else:
                        grid[y][x] = 0
            # add a few rocks (walls)
            for _ in range(gs):
                x = self.rng.randint(0, gs-1)
                y = self.rng.randint(0, gs-1)
                if self.rng.random() < 0.25:
                    grid[y][x] = 1

        else:  # sky
            # sky background with some accent "buildings" and ground
            for y in range(gs):
                for x in range(gs):
                    if y < gs * 0.55:
                        grid[y][x] = 3
                    else:
                        grid[y][x] = 0

            # skyline buildings
            for _ in range(gs // 2):
                w = self.rng.randint(1, 3)
                h = self.rng.randint(2, 4)
                x0 = self.rng.randint(0, gs - w)
                y0 = int(gs * 0.55) - h
                for y in range(y0, max(0, y0 + h)):
                    for x in range(x0, x0 + w):
                        if 0 <= y < gs:
                            grid[y][x] = 1

            # add accents (lights)
            for _ in range(gs):
                x = self.rng.randint(0, gs-1)
                y = self.rng.randint(int(gs*0.45), gs-1)
                if grid[y][x] in (0, 1) and self.rng.random() < 0.2:
                    grid[y][x] = 4

        # Render tile grid into pixels
        for ty in range(gs):
            for tx in range(gs):
                tile_idx = grid[ty][tx]
                color = self._pick_color(tile_idx)  # (3,)
                y_start = ty * ts
                x_start = tx * ts
                img[:, y_start:y_start+ts, x_start:x_start+ts] = color[:, None, None]

        return img

    def __len__(self):
        return self.n_samples

    def __getitem__(self, idx):
        # Deterministic randomness per idx
        self.rng.seed(idx + 12345)
        x = self._generate_one()
        return x

Step 2: A tiny denoiser network that respects “blocky” images

For 32×32 pixel art, I used a small convolutional denoiser. The only special bit is timestep conditioning: the network needs to know how strong the noise is.

I implement timestep embeddings and feed them into the network.

Model: Simple timestep-conditioned CNN denoiser

import torch
import torch.nn as nn
import torch.nn.functional as F

class SinusoidalTimeEmbedding(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.emb_dim = emb_dim

    def forward(self, t):
        """
        t: (B,) integer timesteps
        returns: (B, emb_dim)
        """
        half = self.emb_dim // 2
        device = t.device
        t = t.float()

        freqs = torch.exp(
            torch.arange(half, device=device).float() * (-math.log(10000.0) / (half - 1))
        )  # (half,)

        args = t[:, None] * freqs[None, :]  # (B, half)
        emb = torch.cat([torch.sin(args), torch.cos(args)], dim=1)  # (B, 2*half)
        if self.emb_dim % 2 == 1:
            emb = F.pad(emb, (0, 1))
        return emb

class TinyDenoiser(nn.Module):
    def __init__(self, in_ch=3, hidden=64, time_emb=128):
        super().__init__()
        self.time_emb = SinusoidalTimeEmbedding(time_emb)
        self.time_mlp = nn.Sequential(
            nn.Linear(time_emb, hidden),
            nn.SiLU(),
            nn.Linear(hidden, hidden),
        )

        # Encoder
        self.conv1 = nn.Conv2d(in_ch, hidden, 3, padding=1)
        self.conv2 = nn.Conv2d(hidden, hidden, 3, padding=1)
        self.down = nn.Conv2d(hidden, hidden, 3, stride=2, padding=1)

        # Bottleneck
        self.bot = nn.Conv2d(hidden, hidden, 3, padding=1)

        # Decoder
        self.up = nn.ConvTranspose2d(hidden, hidden, 4, stride=2, padding=1)
        self.conv3 = nn.Conv2d(hidden, hidden, 3, padding=1)
        self.out = nn.Conv2d(hidden, in_ch, 3, padding=1)

    def forward(self, x_t, t):
        """
        x_t: (B, 3, 32, 32) noisy image
        t: (B,) integer timesteps
        returns: predicted x_0 image (B, 3, 32, 32)
        """
        temb = self.time_mlp(self.time_emb(t))  # (B, hidden)

        h = F.silu(self.conv1(x_t))
        h = F.silu(self.conv2(h))

        # Inject time embedding as a channel-wise bias
        h = self.down(h)  # (B, hidden, 16, 16)
        h = h + temb[:, :, None, None]

        h = F.silu(self.bot(h))

        h = F.silu(self.up(h))
        h = F.silu(self.conv3(h))
        pred = torch.sigmoid(self.out(h))
        return pred

Step 3: Diffusion schedule + denoising loss

Here’s the key: I use a simple linear beta schedule and compute:

x_t = sqrt(alpha_bar) * x0 + sqrt(1 - alpha_bar) * noise
Train the network to predict x0 from x_t

To keep outputs blocky, I use a loss that emphasizes correct pixel values but doesn’t over-penalize tiny deviations. A practical option is L1 loss (absolute error), which often preserves edges better than MSE for this kind of data.

Training code (end-to-end)

import math
import torch
import torch.nn.functional as F
from torch.optim import AdamW

# --------------------------
# Diffusion utilities
# --------------------------
def linear_beta_schedule(T, beta_start=1e-4, beta_end=2e-2):
    return torch.linspace(beta_start, beta_end, T)

def compute_alphas(betas):
    alphas = 1.0 - betas
    alpha_bar = torch.cumprod(alphas, dim=0)  # (T,)
    return alphas, alpha_bar

class DiffusionTrainer:
    def __init__(self, T=200, device="cpu"):
        self.T = T
        self.device = device
        betas = linear_beta_schedule(T).to(device)
        alphas, alpha_bar = compute_alphas(betas)
        self.betas = betas
        self.alphas = alphas
        self.alpha_bar = alpha_bar

    def q_sample(self, x0, t, noise=None):
        """
        x0: (B, 3, H, W) in [0,1]
        t: (B,) ints in [0, T-1]
        returns: x_t
        """
        if noise is None:
            noise = torch.randn_like(x0)

        a_bar = self.alpha_bar[t].view(-1, 1, 1, 1)  # (B,1,1,1)
        return (a_bar.sqrt() * x0) + ((1.0 - a_bar).sqrt() * noise)

    def training_step(self, model, x0, t):
        noise = torch.randn_like(x0)
        x_t = self.q_sample(x0, t, noise=noise)

        pred_x0 = model(x_t, t)

        # Denoising autoencoder loss: L1 on reconstructed x0
        loss = F.l1_loss(pred_x0, x0)
        return loss

# --------------------------
# Training loop
# --------------------------
def train_demo():
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print("device:", device)

    dataset = TileMapDataset(n_samples=2000, tile_size=4, grid_size=8, seed=0)
    loader = DataLoader(dataset, batch_size=64, shuffle=True, num_workers=0, drop_last=True)

    model = TinyDenoiser(in_ch=3, hidden=64, time_emb=128).to(device)
    trainer = DiffusionTrainer(T=200, device=device)

    opt = AdamW(model.parameters(), lr=2e-3, weight_decay=1e-4)

    steps = 1500
    model.train()
    it = 0
    while it < steps:
        for x0 in loader:
            x0 = x0.to(device)

            B = x0.size(0)
            t = torch.randint(0, trainer.T, (B,), device=device)

            loss = trainer.training_step(model, x0, t)

            opt.zero_grad(set_to_none=True)
            loss.backward()
            opt.step()

            it += 1
            if it % 200 == 0:
                print(f"step {it}/{steps} loss={loss.item():.4f}")
            if it >= steps:
                break

    return model, trainer

if __name__ == "__main__":
    model, trainer = train_demo()

What’s happening in the training step (why this works better for tilemaps)

I randomly pick a timestep t.
I corrupt the clean tilemap x0 into x_t using the same forward noising rule used at sampling time.
The CNN learns to “undo” the corruption conditioned on t.
Because pixel art has big constant regions, L1 loss keeps those regions stable rather than averaging colors the way MSE can.

Step 4: Sampling (iterative denoising)

Sampling is the reverse process. With this simplified setup (predicting x0 directly), I use a pragmatic update rule:

Start from x_T ~ N(0,1)
For each t descending:
- Predict x0_hat = model(x_t, t)
- Blend x_{t-1} toward x0_hat using the schedule

This isn’t the most theoretically “pure” reverse diffusion, but it produces coherent images for this toy problem, and it matches the training objective directly.

Sampling code

import torch
import math

@torch.no_grad()
def sample(model, trainer, n_samples=8, H=32, W=32, device="cpu"):
    model.eval()
    T = trainer.T

    x = torch.randn(n_samples, 3, H, W, device=device)

    for t in reversed(range(T)):
        t_batch = torch.full((n_samples,), t, device=device, dtype=torch.long)

        x0_hat = model(x, t_batch)  # predicted clean image in [0,1]

        if t > 0:
            a_bar_t = trainer.alpha_bar[t]
            a_bar_prev = trainer.alpha_bar[t-1]

            # Compute a blending factor based on how much signal remains.
            # This keeps the iterative process stable for the direct x0 predictor.
            # Shape: (B,1,1,1)
            a = (a_bar_prev / a_bar_t).clamp(0.0, 1.0).view(-1, 1, 1, 1)

            # Move x toward x0_hat with schedule-aware smoothing.
            x = x0_hat * (1.0 - a) + x * a
        else:
            x = x0_hat

    return x.clamp(0, 1)

if __name__ == "__main__":
    device = "cuda" if torch.cuda.is_available() else "cpu"
    # model, trainer should come from training
    samples = sample(model, trainer, n_samples=8, device=device)

    # Quick visualization: save as a grid
    from torchvision.utils import make_grid
    import torchvision.transforms.functional as TF
    from PIL import Image

    grid = make_grid(samples, nrow=4)
    img = (grid * 255).byte().permute(1, 2, 0).cpu().numpy()
    Image.fromarray(img).save("tile_diffusion_samples.png")
    print("Saved tile_diffusion_samples.png")

Step 5: Verify the “no smear” behavior

When I first trained with an MSE loss, the model started producing gradients inside what should be solid-color tiles. Switching to L1 (and keeping the model tiny) made it much more faithful to the blocky palette structure.

I also noticed timestep conditioning mattered a lot