paper review: Multimodal Transformer for Unaligned Multimodal Language Sequences

2021-02-10 17:30:30 阅读：298 来源： 互联网

标签：information Transformer Language work dataset Multimodal human voice

文章目录

Multimodal Transformer for Unaligned Multimodal Language Sequences

Multimodal Transformer for Unaligned Multimodal Language Sequences

A类会议的论文。ACL

Summary

The author wants to infer how we combine voice with a face. In this paper, the author does many work base on VGGFace and VoxCeleb database. Its main contributions can be summarized as follow :

introduce CNN for binary or multi-way’s face matching with audio.
Using different audio to identify the dynamic speaker.
the author discovers that CNN matches human performance on easy examples (different gender). But it exceeds human judgment in complex examples. (face has the same gender, age, and nationality)

摘要 (中文)

摘要在这篇论文中，我们研究人脸和声音之间的联系。视听整合，特别是面部和声音信息的整合，是神经科学研究的一个重要领域。结果表明，两种模式之间的重叠信息在说话人识别等知觉任务中起着重要作用。通过对我们创建的新数据集进行在线研究，我们证实了之前的发现，即人们可以将看不见的面孔与相应的声音联系起来，反之亦然，其准确性大于随机概率。我们对人脸和声音之间的重叠信息进行了计算建模，并表明学习的交叉模态表示包含了足够的信息来识别匹配的人脸和声音，其性能与人类相似。我们的表现与特定的人口统计属性和特征的相关性，从视觉或听觉模态单独获得。我们发布了我们研究中使用的人们朗读短文本的视听记录和人口统计注释数据集。

Research Objective

We examine whether faces and voices encode redundant identity information and measure to which extent.

Background and Problems

Background
- We humans often deduce various, albeit perhaps crude, information from the voice of others, such as gender, approximate age and even personality.
previous methods brief introduction
- Neuroscientists have observed that the multimodal associations of faces and voices play a role in perceptual tasks such as speaker recognition [19,14,44].
Problem Statement
- not state in introduction

main work

We provide an extensive human-subject study, with both the participant pool and dataset larger.
We learn the co-embedding of modal representations of human faces and voices, and evaluate the learned representations extensively, revealing unsupervised correlations to demographic, prosodic, and facial features.
We present a new dataset of the audiovisual recordings of speeches by 181 individuals with diverse demographic background, totaling over 3 hours of recordings, with the demographic annotations.

work limitations : self dataset is not big enough.

Related work

Human capability for face-voice association:
- . The study of Campanella and Belin [5] reveals that humans leverage the interface between facial and vocal information for both person recognition and identity processing.
  …
Audiovisual cross-modal learning by machinery:
- Nagrani et al. [25] recently presented a computational model for the facevoice matching task. While they see it as a binary decision problem, we focus more on the shared information between the two modalities and extract it as a representation vector residing in the shared latent space, in which the task is modeled as a nearest neighbor search.
  …

Method(s)

Methods one : Study on Human Performance
- 1.Participants were presented with photographs of two diﬀerent models and a 10-second voice recording of one of the models. They were asked to choose one and only one of the two faces they thought would have a similar voice to the recorded voice (V → F).
- 1. dataset : Amazon Mechanical Turk ( the participants fill out a survey about their fender,age, and soon on)
- 1. The result show that participants were able to match a voice of an unfamiliar person to a static facial image of the same person at better than chance levels.
Cross-modal Metric Learning on Faces and Voices
- 1. Our attempt to learn cross-modal representations between faces and voices is inspired by the signiﬁcance of the overlapping information in certain cognitive tasks like identity recognition, as discussed earlier.
- 1. dataset : We use the VoxCeleb dataset [26] to train our network. From each clip, the ﬁrst frame and ﬁrst 10 seconds of the audio are used, as the beginning of the clips is usually well aligned with the beginning of utterances.
- 1. Network : we use VGG16 [33] and SoundNet [2], which have shown suﬃcient model capacities while allowing for stable training in a variety of applications.
- 1. The result show that

Conclusion

main controbution

ﬁrst, with human subjects, showing the baseline for how well people perform such tasks.
On machines using deep neural networks, demonstrating that machines perform on a par with humans.

week point

However, we emphasize that, similar to lie detectors, such associations should not be used for screening purposes or as hard evidence. Our work suggests the possibility of learning the associations by referring to a part of the human cognitive process, but not their deﬁnitive nature, which we believe would be far more complicated than it is modeled as in this work.

further work

not reflected .

Reference(optional)

Arouse for me

This paper related work and introduction is worthed to study。 Because it is not hard to read and reasonable. However, In experiment and method part, It is hard to understand by me.
this paper has self dataset, but it publish version is not able to download. So , I can’t rechieve it.
中文解读： https://blog.csdn.net/weixin_44390691/article/details/105182181?utm_medium=distribute.pc_relevant.none-task-blog-title-2&spm=1001.2101.3001.4242

标签：information,Transformer,Language,work,dataset,Multimodal,human,voice
来源： https://blog.csdn.net/liupeng19970119/article/details/113784089

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9