Training A Tiny Diffusion Model For Pixel Art Tilemaps With Denoising Autoencoder Loss
Written by
Nova Neural
The weird problem I stumbled into
I wanted to generate pixel-art tilemaps (think: 16×16 tiles arranged into a small scene), but I kept running into a very specific failure mode: my outputs looked “smudged” and washed out. The same happened when I tried common denoisers—noise removal helped, but the color blocks blurred like wet watercolor.
After a few weekend sessions of tinkering, I ended up with a niche approach that actually matched the data: instead of treating pixel art like natural images, I trained a tiny diffusion-style denoising model that uses a denoising autoencoder loss designed for discrete-ish color blocks. The result wasn’t perfect, but it was dramatically less smeary and much more “tilemap-like.”
In this post I’ll walk through a working, end-to-end prototype in PyTorch:
- Represent tilemaps as small RGB images
- Train a tiny denoiser with a diffusion-noise schedule
- Use the denoising objective at random timesteps
- Sample by iteratively removing noise
No external datasets are needed: I generate a toy tilemap dataset on the fly so the code runs anywhere.
Background: what I mean by “diffusion-style denoising”
A diffusion model typically learns to reverse a “noising process” that turns a clean image into noise. Here I’m implementing a lightweight variant:
- Take a clean image
x0. - Choose a timestep
t. - Add noise to get a corrupted image
xt. - Train a network to predict the original image (or the noise).
- At sampling time, start from random noise and iteratively denoise.
This version uses a denoising autoencoder loss: the network sees xt and is trained to reconstruct the clean image x0.
A dataset that actually looks like tilemaps
Pixel art tilemaps have sharp edges and repeated structures. To mimic that, I generate small scenes by:
- Choosing a palette of colors
- Drawing a few “tile types” (floor, wall, water, sky)
- Assembling them into an 8×8 grid of tiles, then resizing to a low-res image
Below is a toy generator that produces images of shape (3, 32, 32).
Step 1: Build a toy tilemap dataset
import math import random import torch from torch.utils.data import Dataset, DataLoader class TileMapDataset(Dataset): """ Produces toy 'pixel art tilemaps' as low-res RGB images. Output: float tensor in [0, 1] with shape (3, H, W). """ def __init__(self, n_samples=2000, tile_size=4, grid_size=8, seed=0): super().__init__() self.n_samples = n_samples self.tile_size = tile_size self.grid_size = grid_size self.H = grid_size * tile_size self.W = grid_size * tile_size rng = random.Random(seed) # A small palette: floor, wall, water, sky (+ some accent colors) self.palette = [ (30, 30, 30), # dark floor (210, 210, 210), # wall (40, 120, 220), # water (130, 200, 255), # sky (240, 180, 60), # accent (60, 220, 120), # grass ] # Normalize palette to [0,1] tensors later self.rng = rng def _pick_color(self, idx): r, g, b = self.palette[idx] return torch.tensor([r, g, b], dtype=torch.float32) / 255.0 def _generate_one(self): ts = self.tile_size gs = self.grid_size H = self.H W = self.W img = torch.zeros(3, H, W, dtype=torch.float32) # Choose a "scene theme" theme = self.rng.choice(["cave", "island", "sky"]) # Create a base map of tile types # 0 floor, 1 wall, 2 water, 3 sky, 4 accent, 5 grass grid = [[0 for _ in range(gs)] for _ in range(gs)] # Basic layout patterns if theme == "cave": # Random walls and a watery pool for y in range(gs): for x in range(gs): p = self.rng.random() if p < 0.18: grid[y][x] = 1 else: grid[y][x] = 0 # carve a small "water" rectangle x0 = self.rng.randint(1, gs-3) y0 = self.rng.randint(1, gs-3) w = self.rng.randint(2, 4) h = self.rng.randint(2, 4) for y in range(y0, min(gs, y0+h)): for x in range(x0, min(gs, x0+w)): grid[y][x] = 2 elif theme == "island": # grass around, water at the center, walls as rocks cx = gs // 2 + self.rng.randint(-1, 1) cy = gs // 2 + self.rng.randint(-1, 1) for y in range(gs): for x in range(gs): d = math.sqrt((x-cx)**2 + (y-cy)**2) if d < gs * 0.18: grid[y][x] = 2 elif d < gs * 0.38: grid[y][x] = 5 else: grid[y][x] = 0 # add a few rocks (walls) for _ in range(gs): x = self.rng.randint(0, gs-1) y = self.rng.randint(0, gs-1) if self.rng.random() < 0.25: grid[y][x] = 1 else: # sky # sky background with some accent "buildings" and ground for y in range(gs): for x in range(gs): if y < gs * 0.55: grid[y][x] = 3 else: grid[y][x] = 0 # skyline buildings for _ in range(gs // 2): w = self.rng.randint(1, 3) h = self.rng.randint(2, 4) x0 = self.rng.randint(0, gs - w) y0 = int(gs * 0.55) - h for y in range(y0, max(0, y0 + h)): for x in range(x0, x0 + w): if 0 <= y < gs: grid[y][x] = 1 # add accents (lights) for _ in range(gs): x = self.rng.randint(0, gs-1) y = self.rng.randint(int(gs*0.45), gs-1) if grid[y][x] in (0, 1) and self.rng.random() < 0.2: grid[y][x] = 4 # Render tile grid into pixels for ty in range(gs): for tx in range(gs): tile_idx = grid[ty][tx] color = self._pick_color(tile_idx) # (3,) y_start = ty * ts x_start = tx * ts img[:, y_start:y_start+ts, x_start:x_start+ts] = color[:, None, None] return img def __len__(self): return self.n_samples def __getitem__(self, idx): # Deterministic randomness per idx self.rng.seed(idx + 12345) x = self._generate_one() return x
Step 2: A tiny denoiser network that respects “blocky” images
For 32×32 pixel art, I used a small convolutional denoiser. The only special bit is timestep conditioning: the network needs to know how strong the noise is.
I implement timestep embeddings and feed them into the network.
Model: Simple timestep-conditioned CNN denoiser
import torch import torch.nn as nn import torch.nn.functional as F class SinusoidalTimeEmbedding(nn.Module): def __init__(self, emb_dim): super().__init__() self.emb_dim = emb_dim def forward(self, t): """ t: (B,) integer timesteps returns: (B, emb_dim) """ half = self.emb_dim // 2 device = t.device t = t.float() freqs = torch.exp( torch.arange(half, device=device).float() * (-math.log(10000.0) / (half - 1)) ) # (half,) args = t[:, None] * freqs[None, :] # (B, half) emb = torch.cat([torch.sin(args), torch.cos(args)], dim=1) # (B, 2*half) if self.emb_dim % 2 == 1: emb = F.pad(emb, (0, 1)) return emb class TinyDenoiser(nn.Module): def __init__(self, in_ch=3, hidden=64, time_emb=128): super().__init__() self.time_emb = SinusoidalTimeEmbedding(time_emb) self.time_mlp = nn.Sequential( nn.Linear(time_emb, hidden), nn.SiLU(), nn.Linear(hidden, hidden), ) # Encoder self.conv1 = nn.Conv2d(in_ch, hidden, 3, padding=1) self.conv2 = nn.Conv2d(hidden, hidden, 3, padding=1) self.down = nn.Conv2d(hidden, hidden, 3, stride=2, padding=1) # Bottleneck self.bot = nn.Conv2d(hidden, hidden, 3, padding=1) # Decoder self.up = nn.ConvTranspose2d(hidden, hidden, 4, stride=2, padding=1) self.conv3 = nn.Conv2d(hidden, hidden, 3, padding=1) self.out = nn.Conv2d(hidden, in_ch, 3, padding=1) def forward(self, x_t, t): """ x_t: (B, 3, 32, 32) noisy image t: (B,) integer timesteps returns: predicted x_0 image (B, 3, 32, 32) """ temb = self.time_mlp(self.time_emb(t)) # (B, hidden) h = F.silu(self.conv1(x_t)) h = F.silu(self.conv2(h)) # Inject time embedding as a channel-wise bias h = self.down(h) # (B, hidden, 16, 16) h = h + temb[:, :, None, None] h = F.silu(self.bot(h)) h = F.silu(self.up(h)) h = F.silu(self.conv3(h)) pred = torch.sigmoid(self.out(h)) return pred
Step 3: Diffusion schedule + denoising loss
Here’s the key: I use a simple linear beta schedule and compute:
x_t = sqrt(alpha_bar) * x0 + sqrt(1 - alpha_bar) * noise- Train the network to predict
x0fromx_t
To keep outputs blocky, I use a loss that emphasizes correct pixel values but doesn’t over-penalize tiny deviations. A practical option is L1 loss (absolute error), which often preserves edges better than MSE for this kind of data.
Training code (end-to-end)
import math import torch import torch.nn.functional as F from torch.optim import AdamW # -------------------------- # Diffusion utilities # -------------------------- def linear_beta_schedule(T, beta_start=1e-4, beta_end=2e-2): return torch.linspace(beta_start, beta_end, T) def compute_alphas(betas): alphas = 1.0 - betas alpha_bar = torch.cumprod(alphas, dim=0) # (T,) return alphas, alpha_bar class DiffusionTrainer: def __init__(self, T=200, device="cpu"): self.T = T self.device = device betas = linear_beta_schedule(T).to(device) alphas, alpha_bar = compute_alphas(betas) self.betas = betas self.alphas = alphas self.alpha_bar = alpha_bar def q_sample(self, x0, t, noise=None): """ x0: (B, 3, H, W) in [0,1] t: (B,) ints in [0, T-1] returns: x_t """ if noise is None: noise = torch.randn_like(x0) a_bar = self.alpha_bar[t].view(-1, 1, 1, 1) # (B,1,1,1) return (a_bar.sqrt() * x0) + ((1.0 - a_bar).sqrt() * noise) def training_step(self, model, x0, t): noise = torch.randn_like(x0) x_t = self.q_sample(x0, t, noise=noise) pred_x0 = model(x_t, t) # Denoising autoencoder loss: L1 on reconstructed x0 loss = F.l1_loss(pred_x0, x0) return loss # -------------------------- # Training loop # -------------------------- def train_demo(): device = "cuda" if torch.cuda.is_available() else "cpu" print("device:", device) dataset = TileMapDataset(n_samples=2000, tile_size=4, grid_size=8, seed=0) loader = DataLoader(dataset, batch_size=64, shuffle=True, num_workers=0, drop_last=True) model = TinyDenoiser(in_ch=3, hidden=64, time_emb=128).to(device) trainer = DiffusionTrainer(T=200, device=device) opt = AdamW(model.parameters(), lr=2e-3, weight_decay=1e-4) steps = 1500 model.train() it = 0 while it < steps: for x0 in loader: x0 = x0.to(device) B = x0.size(0) t = torch.randint(0, trainer.T, (B,), device=device) loss = trainer.training_step(model, x0, t) opt.zero_grad(set_to_none=True) loss.backward() opt.step() it += 1 if it % 200 == 0: print(f"step {it}/{steps} loss={loss.item():.4f}") if it >= steps: break return model, trainer if __name__ == "__main__": model, trainer = train_demo()
What’s happening in the training step (why this works better for tilemaps)
- I randomly pick a timestep
t. - I corrupt the clean tilemap
x0intox_tusing the same forward noising rule used at sampling time. - The CNN learns to “undo” the corruption conditioned on
t. - Because pixel art has big constant regions, L1 loss keeps those regions stable rather than averaging colors the way MSE can.
Step 4: Sampling (iterative denoising)
Sampling is the reverse process. With this simplified setup (predicting x0 directly), I use a pragmatic update rule:
- Start from
x_T ~ N(0,1) - For each
tdescending:- Predict
x0_hat = model(x_t, t) - Blend
x_{t-1}towardx0_hatusing the schedule
- Predict
This isn’t the most theoretically “pure” reverse diffusion, but it produces coherent images for this toy problem, and it matches the training objective directly.
Sampling code
import torch import math @torch.no_grad() def sample(model, trainer, n_samples=8, H=32, W=32, device="cpu"): model.eval() T = trainer.T x = torch.randn(n_samples, 3, H, W, device=device) for t in reversed(range(T)): t_batch = torch.full((n_samples,), t, device=device, dtype=torch.long) x0_hat = model(x, t_batch) # predicted clean image in [0,1] if t > 0: a_bar_t = trainer.alpha_bar[t] a_bar_prev = trainer.alpha_bar[t-1] # Compute a blending factor based on how much signal remains. # This keeps the iterative process stable for the direct x0 predictor. # Shape: (B,1,1,1) a = (a_bar_prev / a_bar_t).clamp(0.0, 1.0).view(-1, 1, 1, 1) # Move x toward x0_hat with schedule-aware smoothing. x = x0_hat * (1.0 - a) + x * a else: x = x0_hat return x.clamp(0, 1) if __name__ == "__main__": device = "cuda" if torch.cuda.is_available() else "cpu" # model, trainer should come from training samples = sample(model, trainer, n_samples=8, device=device) # Quick visualization: save as a grid from torchvision.utils import make_grid import torchvision.transforms.functional as TF from PIL import Image grid = make_grid(samples, nrow=4) img = (grid * 255).byte().permute(1, 2, 0).cpu().numpy() Image.fromarray(img).save("tile_diffusion_samples.png") print("Saved tile_diffusion_samples.png")
Step 5: Verify the “no smear” behavior
When I first trained with an MSE loss, the model started producing gradients inside what should be solid-color tiles. Switching to L1 (and keeping the model tiny) made it much more faithful to the blocky palette structure.
I also noticed timestep conditioning mattered a lot