Deep Q-Learning Simulation
An interactive gymnasium environment for reinforcement learning experimentation
Core Capabilities
Interactive Training Lab
PyGame UI with live visualization and full parameter control
DQN Agent
Deep Q-Network with replay buffer and epsilon decay
Real-time Metrics
Live reward graphs, moving averages, and performance tracking
Model Management
Save, load, and transfer learning between configurations
Curriculum Learning
4-stage progression from simple to complex tasks
Fully Customizable
Grid size, packages, learning rate, and more
Quick Start
Clone Repository
git clone https://github.com/ordocaelum/warehouse-rl-demo.git
Install Dependencies
pip install -r requirements.txt
Launch Training Lab
python ui_dashboard.py
Documentation
Interactive Training Lab
Launch the full-featured interactive UI with all parameter controls.
python ui_dashboard.py
Left Panel: Parameter Controls
| Parameter | Range | Default | Description |
|---|---|---|---|
| Grid Size | 6-15 | 8 | Warehouse grid dimensions |
| Packages | 1-5 | 1 | Number of packages to deliver |
| Max Steps | 100-500 | 300 | Episode step limit |
| Timesteps (k) | 10k-500k | 100k | Total training timesteps |
| Learning Rate | 1e-4 to 1e-2 | 1e-3 | DQN learning rate |
| Init Epsilon | 0.5-1.0 | 1.0 | Starting exploration rate |
| Final Epsilon | 0.01-0.3 | 0.1 | Ending exploration rate |
| Buffer (k) | 1k-50k | 10k | Replay buffer size |
| Step Cost | 0.01-0.5 | 0.05 | Per-step reward penalty |
Environment Details
Action Space
| Action | Description |
|---|---|
| 0 | Move up |
| 1 | Move down |
| 2 | Move left |
| 3 | Move right |
| 4 | Pickup package |
| 5 | Dropoff package |
Reward Structure
- +10 for delivering a package to delivery zone
- +1 for picking up a package
- +0.1 bonus for steps reducing distance to nearest package
- -0.01 × distance per step away from nearest package
- -0.05 per step (exploration penalty)
Curriculum Learning Tutorial
Master reinforcement learning by following a proven 4-stage progression from simple to complex tasks.
Master the Basics
Configuration:
- Grid: 8×8
- Packages: 1
- Timesteps: 100k
Agent learns core mechanics: navigation, pickup, delivery
Expected Reward: +8 to +10
Training Time: 5-10 minutes
Add Complexity
Configuration:
- Grid: 10×10
- Packages: 2
- Timesteps: 150k
Agent learns multi-objective planning and route optimization
Expected Reward: +2 to +6
Training Time: 10-15 minutes
Scale Up
Configuration:
- Grid: 12×12
- Packages: 3
- Timesteps: 200k
Agent learns to handle larger environments and more objectives
Expected Reward: -2 to +2
Training Time: 15-20 minutes
Challenge
Configuration:
- Grid: 15×15
- Packages: 3-5
- Timesteps: 300k+
Agent tackles complex real-world scenarios with many objectives
Expected Reward: -6 to -12
Training Time: 30-60 minutes
Best Practices
Parameter Tuning Guidelines
Learning Rate
1e-4: Stable but slow learning
1e-3: Balanced (recommended)
1e-2: Fast but unstable
Timesteps
50k: Quick experiments
100k: Standard baseline
200k+: For convergence
Common Mistakes to Avoid
Mistake 1: Jumping to Hard Tasks Too Early
Start with 8×8, 1 package. Master it. Then scale up gradually.
Mistake 2: Changing Parameters Mid-Training
Set all parameters first, then click START. Don't adjust during training.
Mistake 3: Expecting Immediate Results
Train for at least 10 minutes before evaluating. Learning takes time.
Mistake 4: Ignoring the Moving Average
Watch the orange trend line, not the noisy blue episode rewards.
Model Management
When to Save
- When moving average plateaus
- Before trying new configurations
- After significant improvements
Transfer Learning
Load a model trained on 8×8, 1pkg into 10×10, 2pkg for faster learning.
Important: Only load into SAME environment config, or train fresh on new config.
Configuration Examples
Beginner Setup
- Grid: 8×8
- Packages: 1
- Timesteps: 100k
- Learning Rate: 1e-3
Expected reward: +8 to +10
Perfect for learning the basics
Intermediate Setup
- Grid: 10×10
- Packages: 2
- Timesteps: 150k
- Learning Rate: 1e-3
Expected reward: +2 to +6
Good for multi-objective learning
Advanced Setup
- Grid: 15×15
- Packages: 5
- Timesteps: 300k
- Learning Rate: 1e-3
Expected reward: -8 to -12
Challenge yourself with complex scenarios
Technical Reference
Observation Space
For N packages, observation is (3 + 3N) dimensional:
- Agent X position
- Agent Y position
- Carrying status (0 or 1)
- For each package: X, Y, delivery_status
DQN Parameters
- Learning rate: 1e-3 (adjustable)
- Replay buffer size: 10k (adjustable)
- Target update interval: 500 steps
- Exploration fraction: 20% of training
Keyboard Shortcuts
| Key | Action |
|---|---|
| SPACE | Pause/Resume training |
| S | Save model |
| R | Reset stats |
| Q / ESC | Exit |