OceanNet: Multimodal Emotion Analysis on Social Media with Deep Learning

Abstract

"Massive amounts of multimodal information are available on social media platforms as people generally post images accompanied by textual descriptions and hashtags. The automatic extraction and analysis of affective knowledge from so much data can provide businesses, governments and individuals with insights into the market perception and audience feeling of their actions, products and services. Hence, we propose a Deep Learning approach to processing the visual and textual components of posts extracted from Tumblr in order to predict the emotions attached as hashtags by users. We perform transfer learning on the state-of-the-art InceptionV3 CNN for image classification to capture the highly abstract representations of emotions from pictures, while an LSTM network grasps complex temporal patterns of affects from the text sequences of these posts. Accuracy scores of up to 62.03% and 79.22% have been achieved on the problems of visual and textual emotion recognition, respectively, over the basic model of six emotions. Furthermore, we explore the fusion of these modalities in a multi-input network that combines the single-modality architectures and achieves 75.13% accuracy using this method. An in-depth discussion of the per-class performance of these models is provided as part of the evaluation. Finally, the research on these AI tools is integrated into the OceanNet framework. This incorporates a visualisation tool for market analytics platforms that produces real-time visualisations of affective statistics from social media information."

Motivation

Although the state-of-the-art in sentiment analysis exceeds human-level performance standing at about 90% accuracy, its binary classification mechanism only captures how an entity perceived rather than why it is perceived in a certain manner.
This motivates the need to identify the nuances of feeling portrayed by human emotions, which would provide explicit, actionable.
In human interactions, only part of the message is transmitted verbally as the vocal and visual elements often hold more weight.
We account for the textual and visual modalities using a multi-input network.
The proliferation of social media has fueled recent progress in sentiment analysis as the massive streams of subjective textual and visual information encouraged the automatic collection of affective corpora.
We compile a multimodal dataset of Tumblr posts comprising text and pictures and aim to predict emotion hashtags attached by users.

Summary

We collect a multimodal dataset using the Tumblr API, fetching those text posts accompanied by images that contain an emotion hashtag e.g. #amazed.
We explore the relevance of the textual and visual modalities in expressing emotions by training a set of DL models to perform single- and multi-modal emotion classification:
- For textual emotion analysis, we compose a module comprising single LSTM layer of 1024 units, followed by a softmax layer producing a probability distribution over the set of emotions. This topology is the result of a hyperparameter grid search over the number of layers and neurons per layer.
- For visual emotion analysis, we perform transfer learning on InceptionV3, motivated by its ability to discover intricate structures in training data as our problem requires learning representations at a high level of abstraction.
- For multimodal emotion analysis, we concatenate the outputs of the single-modality networks, on top of which a Softmax function produces a probability distribution over the emotion spectrum.
The accuracy scores achieved by these networks are 68%, 41%, and 62% for textual, visual and multimodal emotion recognition, respectively. The multimodal accuracy comes in contradiction with the findings of DeepSentiment, the state-of-the-art in Tumblr emotion recognition, which noted an improvement when using images alongside text.