Joint Image-Text Classification Using a Transformer-Based Architecture

Patrick Wu and Walter R. Mebane Jr. (University of Michigan)

Abstract: The use of social media data in political science is now commonplace. Social media posts such as Tweets are usually multimodal, comprising of, for example, both text and images. For instance, recent work in election forensics uses Twitter data to capture people's reports of their personal experiences ("incidents") during the 2016 U.S. presidential election (Mebane et al. 2018). That work uses automated text-based classification, but the classifiers do not use all available content---in, they ignore images. But images can provide important context to the text. For instance, some Tweets feature pictures of long lines or of smiling voters wearing "I Voted" stickers. Human coders use both the text and images jointly when determining whether something is an election incident, but the computer does not use the images. Two-stage ensemble classifiers have been developed to handle classifying both text and images together: probabilities are separately generated for the text and for the image to be an observation of interest, then an overall probability is calculated using some predefined function (see, e.g., Zhang & Pan 2019). However, many posts that contain both images and text are recognized as observations of interest only when text and images are considered simultaneously and synergistically. We propose a joint image-text classifier using a transformer-based architecture. Its design is loosely inspired by the architecture of an encoder-decoder used in image captioning (Xu et al. 2016; Krasser 2020) and the transformer (Vaswani et al. 2017). It entirely dispenses recurrence by using a series of attention modules instead of an LSTM or RNN (see Chang & Masterson 2019 for a background on recurrence-based neural networks). As far as we know, this is the first political methodology project to use transformers; it is also one of the first usages of transformers in a multimodal context. The baseline architecture is as follows. We first initialize word embeddings of the tweets using word2vec and we extract image features through a backbone network (a process known as transfer learning), such as ResNet-50 (He et al. 2015) or Mobilenet v2 (Sandler et al. 2016). The word embeddings become the values, keys, and queries of the first multiheaded attention block. This block attends to the self-attended text features from the previous block using the image features as the values and keys. After processing the output through a feedforward layer, we obtain the transformer block output. This output can then be fed into an identical transformer block, repeating this process M times. After M repetitions, the output is fed through a fully connected layer that gives us classification scores. Variations on this architecture include processing the image features through its own multiheaded self-attention module, using varied numbers of heads, and experimenting with the depth, M, of the transformer. We apply this joint image-text transformer classifier to a set of Tweets that contain both text and images that were hand-labeled as observations of incidents in the 2016 U.S. general election: approximately 10,000 Tweets have both text and image while about another 12,000 have only text. The data have both a binary classification objective (tweet of interest or not) and a multiclass classification objective (specific incident types). We also aim to modify the above architecture so it can classify both tweets that only have text and tweets that have both text and image within one pipeline.

View Poster in a New Tab


37136c44caf9656ed77bda51cc40a97a