Summaries from AAAI Conference on Artificial Intelligence on ShortScience.org

www.aaai.org
sci-hub
scholar.google.com

Learning the Preferences of Ignorant, Inconsistent Agents
Evans, Owain and Stuhlmüller, Andreas and Goodman, Noah D.
AAAI Conference on Artificial Intelligence - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Karol Kubicki 7 years ago

### Main ideas

**Key problem:** To infer our preferences, even though our behavior may systematically diverge from them. Examples: a person who smokes event though they prefer not to (but are unable to quit) or somebody who would like to eat healthily, but regularly succumbs to temptation of donuts (which they consider unhealthy).

**Proposed solution:** Modeling human biases directly into reasoning about given agent's behaviour. In the proposed solution [hyperbolic discounting](https://en.wikipedia.org/wiki/Hyperbolic_discounting) is used to account for our time inconsistency.

### Details

Imagine a grid-world in which an agent moves around the grid to find a place to eat. https://i.imgur.com/dxL8fA1.png

An agent is a tuple: $(p(s), U, Y, k, \alpha)$, where:
* $s\in S$ is a state of the world, it is not described in detail in the paper, but, among others, it consists of things like: noodle place is open, vegetarian place is closed.
* $p(s)$ is agent's belief about which state of the world it is in - it is modeled as a probability distribution over states.
* $U$ is agent's (deterministic) utility function - this is the thing we would like to learn the most by observing agent's actions - $U: S \times A \rightarrow \mathbb{R}$, we assign utilities to actions, $a\in A$, in world states $s$.
* Agent chooses actions stochastically where probability ($C(a;s)$) is proportional to their exponentiated expected utility: $C(a;s) \propto \exp^{\alpha EU_{s}[a]}$ or for discounting agents: $C(a;s) \propto \exp^{\alpha EU_{s,d}[a]}$, where $\alpha$ is noise parameter (the lower it is the more randomly agent behaves). Expected utility is described below.
* $Y$ is a variable that denotes the kind of agent:
    * not discounting agent - as its name suggests it won't discount the utility of future actions regardless of the delay, so its expected utility is: $EU_s[a] = U(s,a) + \mathbb{E}_{s',a'}[EU_s'[a']]$, where $s'$ is a state in which agent ends after choosing action $a$ from state $s$ and $a'$ is action choosen in $s'$,
    * discounting naive agent - it discounts utility of future actions based on the delay $d$: $EU_{s,d}[a] =\frac{1}{1+kd} U(s,a) + \mathbb{E}_{s',a'}[EU_{s',d+1}[a']]$, where $k$ is a discount rate (part of agent's description (see tuple above)). 
    Because of the discounting the utility of actions changes with time. It is possible then that an agent who decided to go to vegetarian cafe will change its decision once it is next to donut store. This is shown on the left in the image above. If the agent wanted to go to donut place it could've gone to the closer one. That is why it is called the naive agent - it doesn't take into account that the utility of its action changes and ends up doing things that it didn't plan.    
    * discounting sophisticated agent - its expected utility is also discounted like in naive case, but it chooses future actions $a'$ as if the delay $d$ was $0$. In a sense, it knows that its future self will look at the immediate utility of actions rather then at the utilities they have now. Thanks to that it can for example choose a different path to vegetarian restaurant. It knows that it future self would end up in donut place if it went next to it.
* $k$ is the discount rate (see dicounting naive agent description).
* $\alpha$ is the noise parameter described above together with actions.

Given a sequence of actions done by some agent, we want to infer its preferences. In the paper this is translated into: given a sequence of actions done by some agent update your probability distribution over agent tuples. We start with uniform distribution (zero knowledge) and do bayesian updates with consecutive actions. The model described above will be considered good if probability mass after the updates concentrates on the kinds of agents that a human would infer after seeing the same.

### Results

According to experiments the model performs well, which means that it assigns high probabilities to the kind of agents that humans describe after seeing the same actions. 

For example, after seeing actions from the image above the model as well as human subjects would rate highly explanations of giving in to temptation (in case of naive planner) or avoiding temptation (in case of sophisticated one). The result holds for more complex scenarios: 
* inference with uncertainty - agent might have inaccurate beliefs, for example it might 'think' that the noodle place is open when in fact it's closed, 
* inference from multiple episodes - even though in two out of three cases agent chooses donuts both human subjects and the model assign high probability to the case where the vegetarian place is preferred (among others, they generally agree over variety of explanations).

**Conclusion:** If we want to be able to delegate some of our decisions to ai systems then it is necessary that they are able to learn our preferences despite inconsistencies in our behaviour. The result presented in the paper shows that modeling our biases directly is a feasible direction of research.

arxiv.org
scholar.google.com

Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning
Szegedy, Christian and Ioffe, Sergey and Vanhoucke, Vincent
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 8 years ago

This paper describes the CNN architecture Inception-v4.

They basically update Inception-v3 to use residual connections (see [He et al](http://www.shortscience.org/paper?bibtexKey=journals/corr/HeZRS15)). They also simplified the architecture as they moved from DistBelief to [TensorFlow](https://www.tensorflow.org/).

## Previous papers

* Inception-v1: [Going deeper with Convolutions](http://www.shortscience.org/paper?bibtexKey=journals/corr/SzegedyLJSRAEVR14)

www.aaai.org
sci-hub
scholar.google.com

Character-Aware Neural Language Models
Kim, Yoon and Jernite, Yacine and Sontag, David and Rush, Alexander M.
AAAI Conference on Artificial Intelligence - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 9 years ago

TLDR; The authors build an LSTM Neural Language model, but instead of using word embeddings as inputs, they use the per-word outputs of a character-level CNN, plus a highway layer. This architecture results in state of the art performance and significantly fewer parameters. It also seems to work well on languages with rich morphology.

#### Key Points

- Small Model: 15-dimensional char embeddings, filter sizes 1-6, tanh, 1-layer highway with ReLU, 2-layer LSTM with 300-dimensional cells. 5M Parameters. Hiearchical Softmax.
- Large Model: 15-dimensional char embeddings, filter sizes 1-7, tanh, 2-layer highway with ReLU, 2-layer LSTM with 670-dimensional cells. 19M Parameters. Hiearchical Softmax.
- Can generalize to out of vocabulary words due to character-level representations. Some datasets already had OOV words replaced with a special token, so the results don't reflect this.
- Highway Layers are key to performance. Susbtituting HW with MLP does not work well. Intuition is that HW layer adaptively combines different local features for higher-level representation.
- Nearest neighbors after Highway layer are more smenatic than before highway layer. Suggests compositional nature.
- Surprisingly combinbing word and char embeddings as LSTM input results in worse performance - Characters alone are sufficient?
- Can apply same architecture to NML or Classification tasks. Highway Layers at the output may also help these tasks.

#### Notes / Questions

- Essentially this is a new way to learn word embeddings comprised of lower-level character embeddings. Given this, what about stacking this architecture and learn sentence representations based on these embeddings?
- It is not 100% clear to me why the MLP at the output layer does so much worse. I understand that the highway layer can adaptively combine feature, but what if you combined MLP and plain representations and add dropout? Shouldn't that result in similar perfomance?
- I wonder if the authors experimented with higher-dimensional character embeddings. What is the intuition behind the very low-dimensional (15) embeddings?

www.aaai.org
sci-hub
scholar.google.com

Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models
Serban, Iulian Vlad and Sordoni, Alessandro and Bengio, Yoshua and Courville, Aaron C. and Pineau, Joelle
AAAI Conference on Artificial Intelligence - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 9 years ago

TLDR; The authors train a Hierarchical Recurrent Encoder-Decoder (HRED) network for dialog generation. The "lower" level encodes a sequence of words into a though vector, and the higher-level encoder uses these thought vectors to build a representation of the context. The authors evaluate their model on the *MoviesTriples* dataset using perplexity measures and achieve results better than plain RNNs and the DCGM model. Pre-training with a large Question-Answer corpus significantly reduces perplexity.

#### Key Points

- Three RNNs: Utterance encoder, context encoder, and decoder. GRU hidden units, ~300d hidden state spaces.
- 10k vocabulary. Preprocessing: Remove entities and numbers using NLTK
- The context in the experiments is only a single utterance
- MovieTriples is a small dataset, about 200k training triples. Pretraining corpus has 5M Q-A pairs, 90M tokens.
- Perplexity is used as an evaluation metric. Not perfect, but reasonable.
- Pre-training has a much more significant impact than the choice of the model architecture. It reduces perplexity ~10 points, while model architecture makes a tiny difference (~1 point).
- Authors suggest exploring architectures that separate semantic from syntactic structure
- Realization: Most good predictions are generic. Evaluation metrics like BLEU will favor pronouns and punctuation marks that dominate during training and are therefore bad metrics.

#### Notes/Questions

- Does using a larger dataset eliminate the need for pre-training?
- What about the more challenging task for longer contexts?