Warehouse RL Training Lab

Deep Q-Learning Simulation

An interactive gymnasium environment for reinforcement learning experimentation

View on GitHub

Core Capabilities

Interactive Training Lab

PyGame UI with live visualization and full parameter control

DQN Agent

Deep Q-Network with replay buffer and epsilon decay

Real-time Metrics

Live reward graphs, moving averages, and performance tracking

Model Management

Save, load, and transfer learning between configurations

Curriculum Learning

4-stage progression from simple to complex tasks

Fully Customizable

Grid size, packages, learning rate, and more

Quick Start

Clone Repository

git clone https://github.com/ordocaelum/warehouse-rl-demo.git

Install Dependencies

pip install -r requirements.txt

Launch Training Lab

python ui_dashboard.py

Documentation

Interactive Training Lab

Launch the full-featured interactive UI with all parameter controls.

python ui_dashboard.py

Left Panel: Parameter Controls

Parameter	Range	Default	Description
Grid Size	6-15	8	Warehouse grid dimensions
Packages	1-5	1	Number of packages to deliver
Max Steps	100-500	300	Episode step limit
Timesteps (k)	10k-500k	100k	Total training timesteps
Learning Rate	1e-4 to 1e-2	1e-3	DQN learning rate
Init Epsilon	0.5-1.0	1.0	Starting exploration rate
Final Epsilon	0.01-0.3	0.1	Ending exploration rate
Buffer (k)	1k-50k	10k	Replay buffer size
Step Cost	0.01-0.5	0.05	Per-step reward penalty

Environment Details

Action Space

Action	Description
0	Move up
1	Move down
2	Move left
3	Move right
4	Pickup package
5	Dropoff package

Reward Structure

+10 for delivering a package to delivery zone
+1 for picking up a package
+0.1 bonus for steps reducing distance to nearest package
-0.01 × distance per step away from nearest package
-0.05 per step (exploration penalty)

Curriculum Learning Tutorial

Master reinforcement learning by following a proven 4-stage progression from simple to complex tasks.

Stage 1

Master the Basics

Configuration:

Grid: 8×8
Packages: 1
Timesteps: 100k

Agent learns core mechanics: navigation, pickup, delivery

Expected Reward: +8 to +10

Training Time: 5-10 minutes

Stage 2

Add Complexity

Configuration:

Grid: 10×10
Packages: 2
Timesteps: 150k

Agent learns multi-objective planning and route optimization

Expected Reward: +2 to +6

Training Time: 10-15 minutes

Stage 3

Scale Up

Configuration:

Grid: 12×12
Packages: 3
Timesteps: 200k

Agent learns to handle larger environments and more objectives

Expected Reward: -2 to +2

Training Time: 15-20 minutes

Stage 4

Challenge

Configuration:

Grid: 15×15
Packages: 3-5
Timesteps: 300k+

Agent tackles complex real-world scenarios with many objectives

Expected Reward: -6 to -12

Training Time: 30-60 minutes

Best Practices

Parameter Tuning Guidelines

Learning Rate

1e-4: Stable but slow learning

1e-3: Balanced (recommended)

1e-2: Fast but unstable

Timesteps

50k: Quick experiments

100k: Standard baseline

200k+: For convergence

Common Mistakes to Avoid

Mistake 1: Jumping to Hard Tasks Too Early

Start with 8×8, 1 package. Master it. Then scale up gradually.

Mistake 2: Changing Parameters Mid-Training

Set all parameters first, then click START. Don't adjust during training.

Mistake 3: Expecting Immediate Results

Train for at least 10 minutes before evaluating. Learning takes time.

Mistake 4: Ignoring the Moving Average

Watch the orange trend line, not the noisy blue episode rewards.

Model Management

When to Save

When moving average plateaus
Before trying new configurations
After significant improvements

Transfer Learning

Load a model trained on 8×8, 1pkg into 10×10, 2pkg for faster learning.

Important: Only load into SAME environment config, or train fresh on new config.

Configuration Examples

Beginner Setup

Grid: 8×8
Packages: 1
Timesteps: 100k
Learning Rate: 1e-3

Expected reward: +8 to +10

Perfect for learning the basics

Intermediate Setup

Grid: 10×10
Packages: 2
Timesteps: 150k
Learning Rate: 1e-3

Expected reward: +2 to +6

Good for multi-objective learning

Advanced Setup

Grid: 15×15
Packages: 5
Timesteps: 300k
Learning Rate: 1e-3

Expected reward: -8 to -12

Challenge yourself with complex scenarios

Technical Reference

Observation Space

For N packages, observation is (3 + 3N) dimensional:

Agent X position
Agent Y position
Carrying status (0 or 1)
For each package: X, Y, delivery_status

DQN Parameters

Learning rate: 1e-3 (adjustable)
Replay buffer size: 10k (adjustable)
Target update interval: 500 steps
Exploration fraction: 20% of training

Keyboard Shortcuts

Key	Action
SPACE	Pause/Resume training
S	Save model
R	Reset stats
Q / ESC	Exit