The task is to predict feature norms – object properties, for example is_yellow and is_edible for the word banana. They experiment with adding in image recognition features, in addition to using distributional word vectors.
An input word is used to retrieve 10 images from Google, these are passed through an ImageNet classifier to get feature vectors, and then averaged to get a vector representation for that word. A supervised model (partial least-squares regression) is then trained to predict vectors of feature norms based on the input vectors (image-based, distributional, or a combination). Including the image information helps quite a bit, especially for detecting properties like colour and shape.
https://i.imgur.com/4TYKhvm.png
Examples of predicted feature norms using the visual features.