基于深度视觉的语义比对生成图像描述

以下是资料介绍,如需要完整的请充值下载. 本资料已审核过,确保内容和网页里介绍一致.  
无需注册登录,支付后按照提示操作即可获取该资料.
资料介绍:

基于深度视觉的语义比对生成图像描述(中文6000字,英文10000字)
摘要:我们提出了一个生成图像及其区域的自然语言描述的模型。其方法是利用图像及其句子描述的数据集来了解语言和视觉数据之间的模态对应关系。我们的对齐模型是基于图像区域上的卷积神经网络、语句上的双向递归神经网络(RNN)以及通过多模态嵌入将两种模式对齐的结构化目标的新颖组合。然后描述了一个多模态递归神经网络架构,该架构使用推断的对齐理论来学习生成图像区域的新描述。我们证明了我们的对齐模型在Flickr8K、Flickr30K和MSCOCO数据集的检索实验中产生了最先进的结果。其次,我们发现生成的描述在完整图像和区域级注释的新数据集上都优于检索基线。最后,我们在410万个标题的可视化基因组数据集上对我们的RNN语言模型进行了大规模的分析,突出了图像和区域级标题统计数据之间的差异。
索引词:图像字幕,深层神经网络,视觉语义嵌入,递归神经网络,语言模型

Deep Visual-Semantic Alignments for Generating Image Descriptions
Abstract—We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks (RNN) over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. We then show that the generated descriptions outperform retrieval baselines on both full images and on a new dataset of region-level annotations. Finally, we conduct large-scale analysis of our RNN language model on the Visual Genome dataset of 4.1 million captions and highlight the differences between image and region-level caption statistics.
Index Terms—Image captioning, deep neural networks, visual-semantic embeddings, recurrent neural network, language model