The paper attacks the problem of describing a sequence of images from blog-posts with a sequence of consistent sentences. For this the paper proposes to first retrieve the K=5 most similar images and associated sentences from the training set for each query image. The main contribution of the paper lies in defining a way to select the most relevant sentences for the query image sequence, providing a coherent description. For this sentences are first embedded in a vector and then the sequence of sentences is modeled with a bidirectional LSTM. The output of the bi-directional LSTM is first fed through a relu \cite{conf/icml/NairH10} and fully connected layer and then scored with a compatibility score between image and sentence. Additionally a local coherence model \cite{journals/coling/BarzilayL08} is included to enforce the compatibility between sentences.