Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models on ShortScience.org

arxiv.org
scholar.google.com

Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models
Ashwin K Vijayakumar and Michael Cogswell and Ramprasath R. Selvaraju and Qing Sun and Stefan Lee and David Crandall and Dhruv Batra
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.AI, cs.CL, cs.CV
more

Summaries/Notes 2

[link] Summary by Udibr 8 years ago

[Code](https://github.com/ashwinkalyan/dbs), [Live Demo](http://dbs.cloudcv.org/) ([code for demo site]( https://github.com/Cloud-CV/diverse-beam-search))

Diverse Beam Search (DBS) is an alternative to Beam Search (BS). Decodes diverse lists by dividing the beam budget $B$ (e.g. 6) into groups $G$ (e.g. 3) and enforcing diversity between groups of beams.

For every time step $t$ iterate over all groups. In 1st group find $B'=B/G$ (e.g. 2) partial beams $Y^1_{[t]} = \{y^1_{b,t} : b \in [B']\}$ using BS with NLL. In 2nd group find partial beams $y^2_{b,t}$ using BS with partial beam score taken to be the sum of NLL and the distance between the partial beam and the partial beams in 1st group. The distance is multiplied by a factor $\lambda_t$. For group $g$ the distance is measured between the partial beam $y^g_{b,t}$ and all the partial beams in all groups that were already optimized for current time step. $\Delta(Y^1_{[t]},\ldots,Y^{g-1}_{[t]})[y^g_{b,t}]$

Evaluation Metrics:
* Oracle Accuracy: maximum value of the metric (BLEU) over a list of final beams
* Diversity Statistics: number of distinct n-grams in all final beams
* Human preference

Parameters:
* $G=B$ allows for the maximum exploration and  found to improve oracle accuracy.
* $\lambda \in [0.2-0.8]$

Distance between partial beam and all other groups is broken to a sum of the distances with each group:
$$\Delta(Y^1_{[t]},\ldots,Y^{g-1}_{[t]}) = \sum^{g-1}_{h=1}\Delta(Y^h_{[t]})$$
individual $\Delta(Y^h_{[t]})[y^g_{b,t}]$ is taken to be one of:
* Hamming (gives best oracle performance): proportional to the number of times latest token in $y^g_{b,t}$ was selected as latest token in beams in $Y^h_{[t]}$.
* Cumulative: cancels out Hamming: $\exp\{-(\sum_{\tau \in t} \sum_{b \in B'} \mathbb{1}_{[y^h_{b,\tau} \neq y^g_{b,\tau}]})/\Gamma\}$
* n-gram: number of times each n-gram in a candidate occurred in previous groups
* Neural-embedding: in all previous methods replace hamming similarity with cosine of word2vec of token (or sum of word2vec of n-gram tokens)


My 2 cents:
* Once a beam reaches EOS you need to stop comparing it with other groups
* Using DBS cause results to be longer. Perhaps too much. You can reduce length by adding a penalty to length

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private