Proper credit assignment over long timespans is a fundamental problem in reinforcement learning. Even methods designed to combat this problem, such as TD-learning, quickly reach their limits when rewards are sparse or noisy. This paper reframes offline reinforcement learning as a pure sequence modeling problem, with the actions being sampled conditioned on the given history and desired future rewards. This allows the authors to use recent advances in sequence modeling using Transformers and achieve competitive results in Offline RL benchmarks.
0:00 - Intro & Overview
4:15 - Offline Reinforcement Learning
10:10 - Transformers in RL
14:25 - Value Functions and Temporal Difference Learning
20:25 - Sequence Modeling and Reward-to-go
27:20 - Why this is ideal for offline RL
31:30 - The context length problem
34:35 - Toy example: Shortest path from random walks
41:00 - Discount factors
45:50 - Experimental Results
49:25 - Do you need to know the best possible reward?
52:15 - Key-to-door toy experiment
56:00 - Comments & Conclusion