A neural architecture for detecting off-topic written responses, with respect to visual prompts. The text is composed with an LSTM and then used to condition the image representation. The two representations are then compared to calculate a confidence score for the text being written in response to the prompt image.
https://i.imgur.com/FIltq46.png