Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models
Ashwin K Vijayakumar
and
Michael Cogswell
and
Ramprasath R. Selvaraju
and
Qing Sun
and
Stefan Lee
and
David Crandall
and
Dhruv Batra
arXiv e-Print archive - 2016 via Local arXiv
Keywords:
cs.AI, cs.CL, cs.CV
First published: 2016/10/07 (8 years ago) Abstract: Neural sequence models are widely used to model time-series data in many
fields. Equally ubiquitous is the usage of beam search (BS) as an approximate
inference algorithm to decode output sequences from these models. BS explores
the search space in a greedy left-right fashion retaining only the top-$B$
candidates -- resulting in sequences that differ only slightly from each other.
Producing lists of nearly identical sequences is not only computationally
wasteful but also typically fails to capture the inherent ambiguity of complex
AI tasks. To overcome this problem, we propose \emph{Diverse Beam Search}
(DBS), an alternative to BS that decodes a list of diverse outputs by
optimizing for a diversity-augmented objective. We observe that our method
finds better top-1 solutions by controlling for the exploration and
exploitation of the search space -- implying that DBS is a \emph{better search
algorithm}. Moreover, these gains are achieved with minimal computational or
memory overhead as compared to beam search. To demonstrate the broad
applicability of our method, we present results on image captioning, machine
translation and visual question generation using both standard quantitative
metrics and qualitative human studies. Our method consistently outperforms BS
and previously proposed techniques for diverse decoding from neural sequence
models.
[Code](https://github.com/ashwinkalyan/dbs), [Live Demo](http://dbs.cloudcv.org/) ([code for demo site]( https://github.com/Cloud-CV/diverse-beam-search))
Diverse Beam Search (DBS) is an alternative to Beam Search (BS). Decodes diverse lists by dividing the beam budget $B$ (e.g. 6) into groups $G$ (e.g. 3) and enforcing diversity between groups of beams.
For every time step $t$ iterate over all groups. In 1st group find $B'=B/G$ (e.g. 2) partial beams $Y^1_{[t]} = \{y^1_{b,t} : b \in [B']\}$ using BS with NLL. In 2nd group find partial beams $y^2_{b,t}$ using BS with partial beam score taken to be the sum of NLL and the distance between the partial beam and the partial beams in 1st group. The distance is multiplied by a factor $\lambda_t$. For group $g$ the distance is measured between the partial beam $y^g_{b,t}$ and all the partial beams in all groups that were already optimized for current time step. $\Delta(Y^1_{[t]},\ldots,Y^{g-1}_{[t]})[y^g_{b,t}]$
Evaluation Metrics:
* Oracle Accuracy: maximum value of the metric (BLEU) over a list of final beams
* Diversity Statistics: number of distinct n-grams in all final beams
* Human preference
Parameters:
* $G=B$ allows for the maximum exploration and found to improve oracle accuracy.
* $\lambda \in [0.2-0.8]$
Distance between partial beam and all other groups is broken to a sum of the distances with each group:
$$\Delta(Y^1_{[t]},\ldots,Y^{g-1}_{[t]}) = \sum^{g-1}_{h=1}\Delta(Y^h_{[t]})$$
individual $\Delta(Y^h_{[t]})[y^g_{b,t}]$ is taken to be one of:
* Hamming (gives best oracle performance): proportional to the number of times latest token in $y^g_{b,t}$ was selected as latest token in beams in $Y^h_{[t]}$.
* Cumulative: cancels out Hamming: $\exp\{-(\sum_{\tau \in t} \sum_{b \in B'} \mathbb{1}_{[y^h_{b,\tau} \neq y^g_{b,\tau}]})/\Gamma\}$
* n-gram: number of times each n-gram in a candidate occurred in previous groups
* Neural-embedding: in all previous methods replace hamming similarity with cosine of word2vec of token (or sum of word2vec of n-gram tokens)
My 2 cents:
* Once a beam reaches EOS you need to stop comparing it with other groups
* Using DBS cause results to be longer. Perhaps too much. You can reduce length by adding a penalty to length
TLDR; The authors propose a new Diverse Beam Search (DBS) decoding procedure that produces more diverse responses than standard Beam Search (BS). The authors divide the beam of size B into G groups of size B/G. At each step they perform beam search for each group with an added similarity penalty (with scaling factor lambda) that encourages groups to be pick different outputs. This procedure is done greedily, i.e. group 1 does regular BS, group 2 is conditioned on group 1, group 3 is conditioned on group 1 and 2, and so on. Similarity functions include Hamming distance, Cumulative Diversity, n-gram diversity and neural embedding diversity. Hamming Distance tends to perform best. The authors evaluate their model on Image Captioning (COCO, PASCAL-50S), Machine Translation (europarl) and Visual Question Generation. For Image Captioning the authors perform a human evaluation (1000 examples on Mechanical Turk) and find that DBS is preferred over BS 60% of the time.