Warehouse RL

Deep Q-Learning Simulation

An interactive gymnasium environment for reinforcement learning experimentation

View on GitHub

Core Capabilities

1

Interactive Training Lab

PyGame UI with live visualization and full parameter control

2

DQN Agent

Deep Q-Network with replay buffer and epsilon decay

3

Real-time Metrics

Live reward graphs, moving averages, and performance tracking

4

Model Management

Save, load, and transfer learning between configurations

5

Curriculum Learning

4-stage progression from simple to complex tasks

6

Fully Customizable

Grid size, packages, learning rate, and more

Quick Start

1

Clone Repository

git clone https://github.com/ordocaelum/warehouse-rl-demo.git
2

Install Dependencies

pip install -r requirements.txt
3

Launch Training Lab

python ui_dashboard.py

Documentation

Interactive Training Lab

Launch the full-featured interactive UI with all parameter controls.

python ui_dashboard.py

Left Panel: Parameter Controls

ParameterRangeDefaultDescription
Grid Size6-158Warehouse grid dimensions
Packages1-51Number of packages to deliver
Max Steps100-500300Episode step limit
Timesteps (k)10k-500k100kTotal training timesteps
Learning Rate1e-4 to 1e-21e-3DQN learning rate
Init Epsilon0.5-1.01.0Starting exploration rate
Final Epsilon0.01-0.30.1Ending exploration rate
Buffer (k)1k-50k10kReplay buffer size
Step Cost0.01-0.50.05Per-step reward penalty

Environment Details

Action Space
ActionDescription
0Move up
1Move down
2Move left
3Move right
4Pickup package
5Dropoff package
Reward Structure
  • +10 for delivering a package to delivery zone
  • +1 for picking up a package
  • +0.1 bonus for steps reducing distance to nearest package
  • -0.01 × distance per step away from nearest package
  • -0.05 per step (exploration penalty)

Curriculum Learning Tutorial

Master reinforcement learning by following a proven 4-stage progression from simple to complex tasks.

Stage 1

Master the Basics

Configuration:

  • Grid: 8×8
  • Packages: 1
  • Timesteps: 100k

Agent learns core mechanics: navigation, pickup, delivery

Expected Reward: +8 to +10

Training Time: 5-10 minutes

Stage 2

Add Complexity

Configuration:

  • Grid: 10×10
  • Packages: 2
  • Timesteps: 150k

Agent learns multi-objective planning and route optimization

Expected Reward: +2 to +6

Training Time: 10-15 minutes

Stage 3

Scale Up

Configuration:

  • Grid: 12×12
  • Packages: 3
  • Timesteps: 200k

Agent learns to handle larger environments and more objectives

Expected Reward: -2 to +2

Training Time: 15-20 minutes

Stage 4

Challenge

Configuration:

  • Grid: 15×15
  • Packages: 3-5
  • Timesteps: 300k+

Agent tackles complex real-world scenarios with many objectives

Expected Reward: -6 to -12

Training Time: 30-60 minutes

Best Practices

Parameter Tuning Guidelines

Learning Rate

1e-4: Stable but slow learning

1e-3: Balanced (recommended)

1e-2: Fast but unstable

Timesteps

50k: Quick experiments

100k: Standard baseline

200k+: For convergence

Common Mistakes to Avoid

Mistake 1: Jumping to Hard Tasks Too Early

Start with 8×8, 1 package. Master it. Then scale up gradually.

Mistake 2: Changing Parameters Mid-Training

Set all parameters first, then click START. Don't adjust during training.

Mistake 3: Expecting Immediate Results

Train for at least 10 minutes before evaluating. Learning takes time.

Mistake 4: Ignoring the Moving Average

Watch the orange trend line, not the noisy blue episode rewards.

Model Management

When to Save

  • When moving average plateaus
  • Before trying new configurations
  • After significant improvements

Transfer Learning

Load a model trained on 8×8, 1pkg into 10×10, 2pkg for faster learning.

Important: Only load into SAME environment config, or train fresh on new config.

Configuration Examples

Beginner Setup

  • Grid: 8×8
  • Packages: 1
  • Timesteps: 100k
  • Learning Rate: 1e-3

Expected reward: +8 to +10

Perfect for learning the basics

Intermediate Setup

  • Grid: 10×10
  • Packages: 2
  • Timesteps: 150k
  • Learning Rate: 1e-3

Expected reward: +2 to +6

Good for multi-objective learning

Advanced Setup

  • Grid: 15×15
  • Packages: 5
  • Timesteps: 300k
  • Learning Rate: 1e-3

Expected reward: -8 to -12

Challenge yourself with complex scenarios

Technical Reference

Observation Space

For N packages, observation is (3 + 3N) dimensional:

  • Agent X position
  • Agent Y position
  • Carrying status (0 or 1)
  • For each package: X, Y, delivery_status

DQN Parameters

  • Learning rate: 1e-3 (adjustable)
  • Replay buffer size: 10k (adjustable)
  • Target update interval: 500 steps
  • Exploration fraction: 20% of training

Keyboard Shortcuts

KeyAction
SPACEPause/Resume training
SSave model
RReset stats
Q / ESCExit