Learning to execute instructions in a Minecraft dialogue

Prashant Jayannavar, Anjali Narayan-Chen, Julia Hockenmaier
Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics (ACL)
, July 2020

PDF
ACL talk video
ACL talk slides
Code
The Minecraft Collaborative Building Task is a two-player game in which an Architect A instructs a Builder B to construct a target structure out of 3D blocks. We consider the task of predicting B's action sequences (block placements and removals) in a given game context, and show that capturing B's past actions as well as B's perspective leads to a significant improvement in performance on this challenging language understanding problem.

The Builder Action Prediction (BAP) Task

We define the Builder Action Prediction (BAP) Task as the task of predicting the sequence of actions (block placements and/or removals) that a human Builder performed at a particular point in a human-human game.

Example

The following shows a sample sequence of human-human game states. The game starts with an empty grid and an initial A instruction (a), which B executes in the first action sequence (b) by placing a single block. In (c), B begins to execute the next A instruction given in (b). However, A interrupts B in (c), leading to two distinct B action sequences: (b)‐(c) (single block placement), and (c)‐(h) (multiple placements and removals).

Evaluation

To evaluate models for the BAP task, we compare each model's predicted action sequence against the corresponding action sequence that the human builder performed at that point in the game. Specifically, we compute a micro‐averaged F1 between net actions in the ground truth (human) sequence and in the model's predicted sequence.

Data

We use the Minecraft Dialogue Corpus. Our training, test and development splits contain 3709, 1616, and 1331 Builder action sequences respectively. In this work we also propose data augmentations techniques to generate more synthetic data for training. We increase the training data to 7418 (2x), 14836 (4x) and 22254 (6x) items by sampling items from the synthetic data.

BAP Models

We developed end‐to‐end neural models for the BAP task. Our best model achieves an F1 of 21.2%. Follow the instructions below to use our data and models.

Installation Instructions

Clone our GitHub repo. It hosts our data items, augmented data items and models. Follow the instructions in the README for setup.

Citing our work

If you use this work, please cite:

@inproceedings{jayannavar-etal-2020-learning,
    title = "Learning to execute instructions in a {M}inecraft dialogue",
    author = "Jayannavar, Prashant  and
      Narayan-Chen, Anjali  and
      Hockenmaier, Julia",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.232",
    pages = "2589--2602",
    abstract = "The Minecraft Collaborative Building Task is a two-player game in which an Architect (A) instructs a Builder (B) to construct a target structure in a simulated Blocks World Environment. We define the subtask of predicting correct action sequences (block placements and removals) in a given game context, and show that capturing B{'}s past actions as well as B{'}s perspective leads to a significant improvement in performance on this challenging language understanding problem.",
}

Acknowledgements

This work was supported by Contract W911NF-15-1-0461 with the US Defense Advanced Research Projects Agency (DARPA) Communicating with Computers Program and the Army Research Office (ARO). Approved for Public Release, Distribution Unlimited. The views expressed are those of the authors and do not reflect the official policy or position of the Department of Defense or the U.S. Government.

Last modified: Thu Jul 25 15:46:51 CDT 2019