Benchmark

GameTraversalBenchmark: Evaluating Planning Abilities Of Large Language Models Through Traversing 2D Game Maps

Large language models (LLMs) have recently demonstrated great success in generating and understanding natural language. While they have …

Muhammad Umair Nasir, Steven James, Julian Togelius

MinePlanner: A Benchmark for Long-Horizon Planning in Large Minecraft Worlds

We propose a new benchmark for planning tasks based on the Minecraft game. Our benchmark contains 45 tasks overall, but also provides …

William Hill, Ireton Liu, Anita De Mello Koch, Damion Harvey, Nishanth Kumar, George Konidaris, Steven James

MiDaS: A Large-Scale Minecraft Dataset for Non-Natural Image Benchmarking

Reinforcement learning (RL) has recently made several significant advances using video games as a testbed. While many of these games …

David Torpey, Max Parkin, Jonah Alter, Richard Klein, Steven James

Constructing a Visual Dataset to Study the Effects of Spatial Apartheid in South Africa

Aerial images of neighborhoods in South Africa show the clear legacy of Apartheid, a former policy of political and economic …

Raesetje Sefala, Timnit Gebru, Luzango Mfupe, Nyalleng Moorosi, Richard Klein