Complaint and Severity Identification From Online Financial Content

The automatic detection of financial complaints (FINCORP) can benefit businesses and online merchants. Compared with manually tagged complaints, they can use this information to monitor and address issues and effectively route them to appropriate teams. This can also promote greater transparency and accountability when dealing with consumer financial products and services, strengthening the firm’s brand value. In linguistic studies, complaints have been classified into severity categories based on the level of risk the complainant is prepared to accept. Furthermore, since emotions influence every speech act, an individual’s emotional state considerably impacts the complaint expression. In this article, we introduce a FINCORP resource, a collection of annotated complaints arising between financial institutions and consumers expressed in English on Twitter. The dataset has been enriched with the associated emotion, sentiment, and complaint severity classes. The dataset comprises 3149 complaint and 3133 noncomplaint instances spanning over ten domains (e.g., credit cards, mortgages). For a comprehensive evaluation of our dataset, we develop a multitask framework for complaint detection and severity classification with emotion recognition (ER) and sentiment classification as the additional tasks and compare it with several existing baselines. The corpus and code are available here: https://github.com/RohanBh23/FINCORP.


Complaint and Severity Identification From
Online Financial Content Apoorva Singh , Rohan Bhatia , and Sriparna Saha , Senior Member, IEEE Abstract-The automatic detection of financial complaints (FINCORP) can benefit businesses and online merchants.Compared with manually tagged complaints, they can use this information to monitor and address issues and effectively route them to appropriate teams.This can also promote greater transparency and accountability when dealing with consumer financial products and services, strengthening the firm's brand value.In linguistic studies, complaints have been classified into severity categories based on the level of risk the complainant is prepared to accept.Furthermore, since emotions influence every speech act, an individual's emotional state considerably impacts the complaint expression.In this article, we introduce a FINCORP resource, a collection of annotated complaints arising between financial institutions and consumers expressed in English on Twitter.The dataset has been enriched with the associated emotion, sentiment, and complaint severity classes.The dataset comprises 3149 complaint and 3133 noncomplaint instances spanning over ten domains (e.g., credit cards, mortgages).For a comprehensive evaluation of our dataset, we develop a multitask framework for complaint detection and severity classification with emotion recognition (ER) and sentiment classification as the additional tasks and compare it with several existing baselines.The corpus and code are available here: https://github.com/RohanBh23/FINCORP.Index Terms-Complaint corpus, complaint detection, complaint severity detection, deep learning, finance complaints, multitask learning.

I. INTRODUCTION
A RTIFICIAL intelligence (AI) is transforming how people and organizations access and manage their finances.While the shift from the traditional financial services to digital finance was already underway prepandemic, the pandemic accelerated the process as stay-at-home requests became the new normal for financial institutions, and customers decided to seek more self-service options.AI and machine learning in finance cover everything from fraud detection to chatbot assistants and task automation.Other than these mainstream applications, addressing digital service complaints could be a significant commercial application for AI in finance.Retail banks, government institutions, investment managers, and insurance providers are some of the financial bodies that deal Rohan Bhatia is with the Department of Applied Physics, Delhi Technological University, New Delhi 110088, India (e-mail: rohanbhatia_ 2k19ep079@dtu.ac.in).
Digital Object Identifier 10.1109/TCSS.2022.3215528 with a large number of online consumers and are more likely to be involved in financial grievances [1], [2].Compared with manually tagged complaints, automated detection of financial complaints (FINCORP) about fraudulent transactions, delayed fund transfers, lousy customer service, etc. could help these organizations direct them to appropriate teams.This can further help promote fairness and transparency while dealing with loans, credit cards, and other consumer financial products and services, thus improving customer experience and increasing the organization's brand value.
In linguistics, complaining is defined as an individual's statement of dissatisfaction with an enterprise, commodity, or event [3].Complaints have been grouped into various degrees of severity based on the emotional intensity of the complaint, the amount of face-threat that the complainer is willing to undertake, and the complaint's motive [4], [5].The objective of complaining could be to voice dissatisfaction, seek explanations, or both.Identifying complaints and associated severity levels in natural language is crucial for downstream application developers such as customer service chatbots [6] and commercial organizations to improve their customer support capabilities by identifying and resolving complaints [7].
Social media sites such as Twitter1 and Facebook2 offer millions of messages and information to be exchanged daily, which are multifaceted and invaluable resources for investigating and understanding people's perceptions and evaluating the data generated by participants.The social aspect of Twitter, combined with its huge volume, has transformed it into a costeffective data repository for exploring and evaluating activities and user opinions.Moreover, due to the micro-blogging format of Twitter, users find it more convenient to post their complaints on Twitter and "tag" the Twitter account they want to address, rather than emailing them with their grievances.Organizations respond to complaints promptly, proactively, and in real-time to improve their brand image and social media visibility.According to a study3 conducted in 2020, the number of consumers who chose to vent their complaints via digital platforms rather than by phone or in-person has tripled in the last three years, and good complaint handling remains associated with high levels of brand loyalty, and mediocre complaint handling costs businesses dearly.As a result, the crux of this study is analyzing Twitter-based complaints essentially focused on the time-sensitive financial grievances' domain.
Previous studies [10] on emotion recognition (ER) have found that emotion provides a deeper understanding of the consumer's mind when combined with sentiment.This relationship between emotion and sentiment urges us to examine the sentiment and emotion of customers while studying complaints.Besides, previous studies on complaint detection are based on the skewed datasets, sorted in generic domains, and do not consider the correlated tasks (severity, emotion, and sentiment) while detecting complaints.Our current work aims to bridge this gap and focuses explicitly on the financial grievances originating between consumers and financial organizations.
The following are the major contributions of our proposed work.
1) We curate a new corpus FINCORP for aiding complaint identification (CI) and complaint severity classification research.2) We perform good quality annotations for the FINCORP dataset for solving four tasks: complaint detection, severity-level detection, ER, and sentiment analysis (SA). 3) We present a framework for joint learning of: 1) binary complaint classification; 2) complaint severity level classification (CS); 3) ER; and 4) SA.CI and CS are the primary tasks in our multitask framework.In contrast, ER and SA are considered additional tasks.According to the experimental results, the proposed approach surpasses the baselines and state-of-the-art approaches.The rest of this article is structured as follows.We present a brief overview of the existing literature in Section II.In Section III, we go over the specifics of the proposed dataset.In Section IV, we describe our proposed technique for the multitask experiments.The experimental setting and results analysis are covered in Section V. Finally, we conclude in Section VI with a discussion on future work prospects.

II. RELATED WORK
This section discusses some prior works on the CI task and the publicly available annotated complaint corpora.

A. Complaints in Linguistic Studies
Previous work in linguistics has categorized complaints based on their severity and directness.Complaints were classified into five categories by Olshtain and Weinbach [3]: 1) below reproach; 2) statement of disapproval or dissatisfaction; 3) explicit complaint; 4) allegation; and 5) warning, immediate threat.According to Trosborg [4], complaints were categorized into four fine-grained severity classes: 1) no explicit reproach; 2) disapproval; 3) accusation; and 4) blame.Quite recently, Kakolaki and Shahrokhi [5] divided complaints into very direct, moderately direct, and indirect.Clear violations of expectations are referred to as direct complaints (i.e., very direct and moderately direct).On the other hand, indirect complaints do not directly specify or imply a breach of expectations.Furthermore, the contrast between very direct and moderately direct is that the former underlines the obligation of the complaint recipient, while the latter does not.

B. Multitask Learning
Previous findings [10], [12], [13] have confirmed the efficiency of the multitask systems by simultaneously learning multiple associated tasks.An individual's emotional state and sentiment have a decisive impact on the intended content [14].Along with sentiment, emotion offers a deeper understanding of the consumer's mindset [10], [15].The correlation between emotion and sentiment motivates us to consider complainants' sentiment and emotion while analyzing complaints.Moreover, sentiment and emotion play significant roles in human interactions and thereby contribute toward building efficient and versatile AI-based systems.

C. Complaint Identification
CI on social media is a time-consuming and challenging process that needs identifying complaints from disparate and noisy text samples with character constraints, the use of unpredictable abbreviations, and sarcastic expressions.Earlier works on complaint detection majorly focused on distinguishing complaints from noncomplaints using hand-crafted featurebased machine learning models [8], semi-supervised learning approach [16], deep learning models based on transformer networks [11], and a few multitask models that leveraged polarity and emotion information for the complaint mining task [12], [17], [18].Besides complaint classification, studies have focused on product hazards and risks [19] and the propensity of escalation [20].Recently, Jin and Aletras [21] evaluated the severity level of complaints by training several transformer-based networks paired with linguistic information to predict severity levels in complaints.

D. Complaint Datasets
The Complaints dataset4 was published in the work [8].The authors selected 93 customer service handles from Twitter and sampled original tweets addressed to these accounts.Only tweets that received a response from the customer support accounts were considered.A total of 1971 tweets were collected using this method.In addition, 1478 tweets were sampled to guarantee a diverse and representative range of tweets.Overall, the dataset consists of 2214 noncomplaints and 1235 complaints in English.The dataset was grouped into nine domains (i.e., Apparel, Cars, Electronics, Food, Retail, Service, Software, Transport, and Other).Recently, Jin and Aletras [21] extended the Complaints dataset with the complaint severity classes and analyzed the effect of severity levels on CI.The Product Review corpus5 [9] is a Hindi-language-based product review corpus.This corpus is a collection of product reviews posted on the retail website Amazon 6 and the video-sharing platform, YouTube7 comment section.The corpus comprises 3711 instances, with 3145 being noncomplaints and 566 being complaints.The dataset was further grouped into five domains (i.e., Book, Headphones, Phone, Watch, and Miscellaneous).Many financial institutions and services have a presence on Twitter, with specific customer care accounts.Due to the presence of such accounts and the micro-blogging format of Twitter, users find it more convenient to post their complaints on Twitter and to seek redressal rather than the time-taking option of meeting in person or emailing the organizations.In this work, we introduce the Twitter-based FINCORP dataset; unlike the other two datasets, which contain generic complaints from various domains, our dataset focuses on complaints arising between financial institutions and consumers.In contrast to the existing Complaints and Product Review datasets, FINCORP is a balanced dataset with nearly equal percentage of complaints (50.13%) and noncomplaints (49.87%) samples.Another unique feature of our dataset is that in addition to the CI axis, we have expanded the dataset on three other closely related axes (severity-level classification, sentiment classification, and ER) to better understand the nuances of the CI task.To the best of our knowledge, this is the first gold standard complaint dataset specifically dealing with FINCORP.Table I shows the detailed description of various existing and proposed CI-related datasets.Table II reports a comparative analysis of the existing works in CI domain and the proposed work.

III. CORPUS DESCRIPTION
The existing complaint datasets [8], [9] constitute only binary class complaints.To encourage CI and severity prediction research in the financial domain, we curate a FINCORP dataset labeled with complaint labels, severity levels, emotion, and sentiment classes.We chose Twitter as the data source as it is a forum that allows for a lot of self-expression, and users can communicate directly with each other or with corporate support handles.Tweets are freely available and are a popular choice for data analysis in various other similar tasks such as sentiment and emotion analysis (see [10]).Here, we discuss the details of the FINCORP dataset.

A. Data Collection
Many companies use Twitter to offer customer support and deal with customer complaints.Retail banks, government institutions, payment processors, investment managers, and insurance providers are the financial bodies that mainly deal with a large number of consumers and are more likely to be involved in financial grievances.An analysis of different services these financial organizations deliver provided the relevant domains to which FINCORP can belong.All these facts were used, and strings containing the names of these institutions combined with the domains relevant to them were used as keywords for tweet collection.Active Twitter handles of the retail banks such as Citibank, State Bank of India, and HDFC Bank were combined with retail banking-related keywords.Hence, the keyword creation for collecting tweets involved two tasks.
1) Analyzing common domains of FINCORP (for example, credit card, transaction failure).2) Analyzing which financial institutions and services those complaints are directed to (for example, retail banks, regulatory bodies, netbanking services).Retail banking services include the provision of credit cards, debit cards, and loans to consumers.Thus, we include all these as keywords. 8Payment processors are responsible for handling transaction failures.Usually, they are a matter of great concern and urgency for consumers, thus inducing a high number of complaints and demands for refund.Government regulatory bodies are approached with complaints regarding taxes, loans, interest rates, and in the situation of unsatisfactory service by other financial bodies like retail banks.Investment managers and insurance providers receive complaints regarding their financial policies, investments.Many consumers complain about customer service provided by financial bodies of all types.Many financial services the organizations provide are digitized and used as mobile applications, whose performance again attracts complaints.
In addition, generic keywords related to financial domains were used as keywords to balance the dataset with noncomplaints.Marcellis-Warin et al. [1] discuss the increasing importance and abundance of Twitter discussions related to stocks, economy, salary, and other political discussions regarding the financial market in India.A complete list of all the keywords is provided in our GitHub repository.
We use Tweepy 9 library to fetch tweets containing specific keywords and phrases using its cursor class.Twitter Developers' 10 account is required to use this library, in which API tokens are provided by Twitter necessary to make the tweepy API object.We anonymize all the usernames and URLs in the tweet and substitute them using placeholder tokens for preprocessing.Using langid.py(see [22]), tweets were filtered for English, and retweets were discarded.The Unicode emojis in the tweets were converted into emoji short text with the help of a Python module called Emoji.The Python library Textblob was used to correct the spellings of many improperly spelled words.
We collected 6945 tweets to be considered for annotation using this technique.We manually sorted the tweets into ten domains (customer service, financial policies, salary, and transaction failure) based on the financial product type and expressed concerns.Moreover, we took motivation from financial fraud detection work [2] which is based on Federal Bureau of Investigation (FBI) reports on financial crimes, to sort tweets in high-level domains (credit card, debit card, investments, loans, retail banking).We have done this to enable complaint analysis on a more granular level.In related studies on complaint analysis, tweets and product reviews were similarly sorted based on domains (see [8], [9]).

B. Data Annotation
We selected three graduate students 11 to annotate the tweets with suitable complaint/noncomplaint classes, severity levels, and emotion and sentiment tags.Before the annotation process began, the guidelines for annotation were provided.Some sample cases outside the corpus were provided to train the annotators for the annotation work.
Annotators were selected using the following criteria.1) They are graduate students fluent in English and have adequate domain knowledge and expertise in developing supervised corpora.2) Prepared to read, interpret, and annotate Twitter instances according to the provided guidelines.A few examples were provided before annotation for severity level and emotion task annotation as shown in Tables III and IV, respectively.Initially, we construct a binary annotation task to determine whether or not a tweet involves a complaint.Tweets are most often brief and convey a single idea.As a result, if a tweet includes a single complaint speech act, we consider the overall tweet to be a complaint.We use 9 https://www.tweepy.org/ 10 https://developer.twitter.com/ 11Annotators were recruited from the author's institution.the complaint definition from earlier linguistic research [23] for complaint annotation: A complaint presents a state of affairs that breaches the writer's clear expectation.
We use the complaint severity levels (no explicit reproach, disapproval, accusation, and blame) defined in [4] as it is considered a benchmark in linguistic studies. 12According to [4], classes are disconnected.In the "No explicit reproach" class, the reason for the complaint is not specified, and the complaint is not harsh.The "Disapproval" class conveys clear negative emotions, including discontent, anger, contempt, or disapproval.The primary distinction between "Accusation" and "Blame" is that the complainant assumes the complainee is responsible for the violation in the latter.For the emotion annotation of the FINCORP dataset, we consider Ekman et al.'s [24] six basic emotions (anger, disgust, fear, happiness, sadness, and surprise).Other than these six basic emotions, we add a category "Other" to represent such tweets that do not fall under the six emotion classes' scope.In cases where emotion (anger, disgust, fear, sadness, surprise, and happiness) expressed in the tweet could fall in more than one emotion class, the annotators label it according to the strongest emotion associated with the tweet.For the sentiment annotation, we consider three sentiment classes (negative, neutral, positive).
The final complaint, severity level, emotion, and sentiment labels are selected using the majority voting technique.The final annotated dataset excluded tweets with no common severity level, emotion, or sentiment label.
Finding agreement among the different annotators is vital in any annotation task requiring two or more annotators to obtain a credible annotated dataset.One such metric is Fleiss' kappa [25] score.We computed the Fleiss' kappa score on the annotated dataset to assess interrater agreement among the three annotators.We attained agreement scores of 0.83, 0.69, 0.68, and 0.82 on the complaint, severity level, emotion, and sentiment tasks, respectively, indicating that the annotations are good quality.

C. Statistics
The FINCORP dataset comprises 6282 tweets with the corresponding complaint, severity, emotion, and sentiment labels.It consists of 3133 tweets in the noncomplaint category  (49.87%) and 3149 tweets in the complaint category (50.13%).Distributions of sentiment and emotion labels across the dataset are shown in Figs. 1 and 2, respectively.Table V presents the domainwise data distribution statistics.Table VI shows sample instances along with the corresponding complaint, severity, emotion, and sentiment labels.As shown in Fig. 2, most complaint-type tweets have a negative polarity, with a few being neutral.In the case of emotion labels, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

IV. METHODOLOGY
In this section, we outline our problem and discuss the details of the baselines used for the evaluation of the dataset.

A. Problem Definition
We aim to detect complaints and associated severity levels from tweets in a multitask scenario.Each instance includes the tweet instance and is classified based on four tasks, i.e., two main tasks, CI and CS, and two auxiliary tasks of ER and SA.The training data consisting of T tweet instances are represented as {[t, c, s, e, p] i } N i=1 , where t denotes the tweet, c denotes the complaint label, s is the severity label, e is the emotion class label, and p is the sentiment class label.
Our multitask learning framework's objective is to maximize the function f f = argmax T k=0 P(e i , p i , s i , where t i is the input tweet whose complaint label (c i ), severity label (s i ), emotion label (e i ), and sentiment label ( p i ) are to be predicted.θ denotes the model's parameters we aim to optimize.

B. Multitask Learning With RoBERTa (MTL-RoBERTa)
Based on the work in [26], we use multitask learning with hard parameter sharing (MTL) to boost the binary complaint classification using severity levels, emotion, and sentiment tasks.MTL with hard parameter sharing comprises a single encoder that is shared and modified by all the tasks and succeeded by task-specific branches.MTL allows two or more tasks to be learned simultaneously by sharing model knowledge and parameters.This boosts the binary complaint classification using severity levels, emotion, and sentiment task features.The overall MTL-RoBERTa framework is shown in Fig. 3.
1) Shared Transformer Encoder: Bidirectional encoder representations from transformers (BERT) is a multilayer bidirectional transformer encoder [27] based on the original work shown in [28].It is trained on masked language modeling, which involves arbitrarily masking certain tokens from the input to predict them only based on context.RoBERTa [29] is an extension of BERT that has exhibited enhanced performance on social media analysis tasks [30], [31], by training on more data with varying hyperparameters.
The proposed MTL-RoBERTa model uses a single encoder layer that is shared and modified by all the four tasks, with independent feed-forward networks for each task.We use the pretrained RoBERTa model to embed the words of each tweet instance.The embedding representations are then fed to the shared transformer encoder (the "roberta-base" model supports an inbuilt encoder as an attribute) to generate the instance's latent representation.
2) Task-Specific Layers: The RoBERTA encoder output corresponding to each task-specific input sentence is fed to four task-specific BiGRU [32] layers that aid in extracting the contextual information from the tweet instances.The input and forget gates of bidirectional recurrent neural networks, or bidirectional GRUs (with 128 neurons), capture contextual data from both forward and backward time-steps, producing a hidden representation (h i x ) of each word (w i x ) in the sentence where x is the xth word in the sentence and i indicates the sentence number.− → h i x and ← − h i x are the forward and backward hidden representations of t i x , respectively, summed up to produce h i x .
This representation is then passed in parallel to four fully connected layers (for each of the four tasks), each of which has two hidden layers, to predict the output for each task separately.Each task-specific loss function is assigned a weight by the final loss function to be optimized.Finally, we add four separate output layers with softmax activation functions for binary and multiclass prediction.
3) Calculation of Loss: The categorical cross entropy (CE) losses are calculated for the complaint (C), severity (S), emotion (E), and sentiment (P) tasks.Our proposed model's (MTL-RoBERTa) integrated loss function [ J (θ )] is realized as follows: All the model parameters to be optimized are denoted by θ .J CE is the categorical CE loss used for the complaint classification task, and it is defined as follows: where y t is the actual label of the tth instance, and N c is the number of classes.ŷt i is the predicted label.

V. EXPERIMENTS AND RESULTS
This section discusses the result and analysis of various baseline models and our proposed paradigm, as tested on the FINCORP corpus.The dataset and codes will be released publicly via the corresponding repository.

A. Baselines
Since FINCORP is a new dataset, for a comprehensive evaluation, we evaluate the dataset on multiple baselines, discussed as follows.

1) Logistic Regression (LR-BOW):
To measure the performance of the deep learning models, we also use logistic regression, a standard machine-learning-based classifier that has repeatedly served as a good baseline in many classification tasks.We use logistic regression with standard bag-of-words (BOW) for CI and CS tasks.2) Single-Task Systems: For single-task CI, we compare the proposed model with the existing CI model (Baseline 2 ) proposed in [11].Baseline 2 uses a variety of neural language models aided by transformer networks.Furthermore, for CI and CS tasks, we develop single-task versions of the proposed model (STL CI , STL CS ). 3) Multitask System: Based on the existing works in CI, we develop the Baseline 1 [12] model as the multitask baseline.Baseline 1 is a shared private multitask model for simultaneous classification of complaint, emotion, and sentiment tasks.For training the model, Pennington et al. [33] used three different sources of data.We also build an MTL GloVe model wherein the textual embeddings are generated using the pretrained GloVe. 13The word sequence encoder receives the embedding layer's output and analyses it to attain contextual knowledge from the text.The rest of the framework is the same.4) Ablation Models: To evaluate the influence of complaint severity, emotion, and sentiment prediction on the complaint task independently and in varying combinations, we developed dual-task (MTL CI+ER , MTL CI+SA , MTL CI+CS ) and tri-task variants (MTL CI+ER+SA , MTL CI+ER+CS , MTL CI+SA+CS ) of our proposed model.

B. Experimental Setup
We use Python-based libraries to implement our proposed framework and the baselines, namely, TensorFlow 14 and Scikit-learn 15 [34].The pretrained model (RoBERTa) is from the open-source library 16 huggingface transformer.We ran all the experiments on a single NVIDIA GeForce RTX 2080 Ti GPU.We evaluated the predictive performance of the proposed model and the baselines using accuracy and macro-F1 score.We have used 60% of the FINCORP dataset as training data, 10% for validation, and 30% as testing data for all the experiments.A seed value of 40 was used for a fair assessment of the models, allowing them to encounter the same training and testing data.To effectively use the GPU, we kept the batch size at 32.We use Adam optimizer [35] with a learning rate of lr = 1e −3 , lr {1e −1 , 1e −2 , 1e −3 , 1e −4 }.We use the base uncased model for RoBERTa and fine-tune it with learning rate of lr = 5e −5 , lr {1e −4 , 5e −5 , 1e −5 , 1e −6 }.The maximum sentence length is set at 50 characters, which spans 93% of the tweets in the training set.We use ReLU [36] activation and a dropout [37] of 30%, dropout {20%, 30%, 50%}.The output layers for complaint, severity, emotion, and sentiment classification tasks use Softmax activation with two, four, six, and three neurons, respectively.To train across all the channels, categorical CE is used as the loss function.All these parameter values were chosen following a thorough sensitivity study.
1) Evaluation Metrics: We report the accuracy and macro-F1 (F1) scores for the two primary tasks, CI and CS.
1) Accuracy: It is defined as the percentage of rightly predicted observations divided by the total number of observations Accuracy = Correctly Predicted Observations Total Observations .(7) 2) F1 Score: The F1-score metric aggregates precision and recall into a single metric by calculating the harmonic mean of both.The ratio of accurately predicted positive observations to the total predicted positive observations is known as precision.The recall is defined as the percentage of accurately measured positive observations compared with the total number of observations in the class.The macro-F1 score is computed by aggregating each class score and then computing the mean of the results

1) Results
Discussion: Note that the current article's objective is to boost the efficacy of the CI and CS tasks using the two secondary tasks (ER and SA).As a result, we present the results and analysis with CI and CS as the primary tasks in all task pairings.
The classification results from the various experiments are shown in Table VIII.The proposed model outperforms the multitask baseline (Baseline 1 ) with an improvement of 8.94% and 5.32% on the macro-F1 score metric for CI and CS tasks, respectively.It can be observed that combining all the associated tasks, such as severity, emotion, and sentiment, improves the performance compared with the single-task, dualtask, and tri-task variants.This improvement indicates that the proposed model effectively uses the interaction between the four tasks.As seen in Table VIII, the MTL-RoBERTa model, which includes all the four tasks (CI, CS, ER, and SA), beats the single-task complaint and severity variants (STL CI , STL CS ).MTL CI+CS outperforms MTL CI+ER and MTL CI+SA in the dual-task variants.This supports our assumption that the  of 90.04%.One of the reasons for the low results of MTL CI+ER+SA and MTL CI+SA+CS variants could be that the SA task does not contribute much to the multitask architecture when compared with severity and emotion tasks.
In Table IX, we report the results of the MTL-RoBERTa model for the complaint severity task.Overall, the MTL-RoBERTa model yields the best result with the macro-F1 score of 75.43%.Moreover, of all the baselines, MTL CS+ER+CI outperforms other baselines with the macro-F1 score of 72.89%.This is in line with [4], where the author claims that the way people convey their grievances is linked to their emotional states.
The findings reported here are all statistically significant [38]. 17 2) Comparison With State-of-the-Art (SOTA): For both the primary tasks CI and CS, we compare the proposed model MTL-RoBERTa with the existing state-of-theart technique [21], the SOTA model transformer-based model which uses several linguistic features.Whereas our proposed MTL-RoBERTa model is a multitask framework that efficiently uses the additional information from sentiment and emotion tasks while jointly learning the main tasks of CI and CS.We reimplement the SOTA model on the FINCORP dataset and report the results in Tables VIII and IX.Even though both the models leverage the transformer model as the underlying mechanism, the proposed model's simultaneous learning of correlated tasks further boosts the results.
3) Qualitative Analysis:  17 For the significance test, we used Student's t-test.When testing the null hypothesis, the results are statistically significant ( p-value < 0.05).

4) Error Analysis:
We examine some of the plausible reasons for the MTL-RoBERTa model's complaint and severity prediction inaccuracies.
1) Elusive Complaints: Noncomplaints are incorrectly predicted when complaints are communicated implicitly or cryptically.One justification for this could be the complainant's lack of explicit accusation or blame.For example, "Why am I waiting 60 min just so I can ok a debit card transaction!?"The predicted class is noncomplaint, whereas the instance's actual class is complaint.Unless the tweet does not mention the particular company or use a critical phrase, the model misinterprets it as noncomplaint.2) Neighboring Severity-Level Errors: Instances that belong to neighboring severity levels tend to be semantically and compositionally similar, sometimes leading to misclassification.For example, "You are looting consumer's money.It's very disappointing behavior that they didn't refund back failed transaction money even after weeks and their customer support show us the way to our bank, who doesn't recognize their transaction number"; predicted severity level: "disapproval."The correct severity level for the preceding example is "accusation," but due to the presence of negative terms such as disappointing which usually appear in the "disapproval" severity class, the proposed model misclassifies it.3) Satirical Instances: Instances where the underlying tone is positive or neutral, but the instance is of complaint type, the MTL-RoBERTa model inaccurately predicts such instances as noncomplaint.For example, "Yes, I need to pay a transaction that cannot be done online/over the phone because it exceeds the daily limit."For this instance, the correct class is complaint, but the model predicts it as noncomplaint.One reason could be the neutral undertone and the usage of less explicit words to signify complaint.4) Probing Complaints: We also identified that the model incorrectly labels cases of weak discontent with an inquiring tone as noncomplaints.For example, "What is this restaurant packaging charges and where is this money going when McDonald's on their website is not Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
charging it?"The correct class for the given example is complaint.The model classifies it as noncomplaint because there is no clear objection, and the sentence is interrogative.To effectively classify such occurrences, the system requires background information and knowledge about the key aspects of the sentence.5) Limitations of the Study: Even though our model is able to outperform the baselines in various single-task and multitask settings, there are some possible limitations to our approach as discussed below.
1) The model was trained and evaluated on only English language tweets.To accommodate other languages, further training in different language and code-mixed settings would be necessary.2) People frequently use sarcasm to critique about different issues faced online.But the specific class of sarcasm could not be considered as only a few curated samples were sarcastic.Hence, in future complaints with sarcastic remarks could be a possible area of analysis.
3) The proposed dataset has been curated with complaints specifically originating on Twitter, and the majority of tweets are from the users residing in the Indian subcontinent.

VI. CONCLUSION
In this work, we have introduced FINCORP, an annotated dataset of complaints originating between financial institutions and consumers on Twitter.We have addressed the task of CI and severity-level prediction (primary tasks) jointly using a multitask framework assisted by emotion and sentiment detection as auxiliary tasks.This article goes into a detailed description of the dataset and the complete process of creating it.Every instance in the FINCORP dataset has been annotated across four axes: complaint, severity level, emotion, and sentiment classes, resulting in a dataset with a lot of diversity.We evaluated the dataset and reported the findings using several single-task, multitask, and other existing baselines.Given the scarcity of task-specific (CI) data in English and even some low-resource languages, we believe this dataset would add value to social media analytics research and practice.
We plan to expand our work in CI at the sentence level in English and code-mixed languages in the future.Furthermore, the annotated corpus can aid research in other emotion and sentiment classification tasks.

TABLE I STATISTICS
OF THE EXISTING AND PROPOSED DATASET FOR CI TABLE II COMPARATIVE STUDY OF THE EXISTING LITERATURE BASED ON CONTRIBUTION AND LIMITATIONS

TABLE III ANNOTATION
EXAMPLES FOR EACH COMPLAINT SEVERITY LEVELS

TABLE VI SAMPLE
INSTANCES FROM THE FINCORP DATASET.COMP: COMPLAINT, NON-COMP: NONCOMPLAINT

TABLE VIII RESULTS
OF THE PROPOSED MODEL AND OTHER BASELINES WITH RESPECT TO MACRO-F 1 SCORE AND ACCURACY VALUE FOR CI TASK.BOLD-FACED VALUES INDICATE THE MAXIMUM SCORES ACHIEVED.THE * INDICATES THAT THE RESULTS ARE STATISTICALLY SIGNIFICANT

TABLE IX RESULTS
OF ALL THE PROPOSED MODEL AND OTHER BASELINES WITH RESPECT TO THE MACRO-F 1 SCORE AND ACCURACY VALUE FOR CS TASK.BOLD-FACED VALUES INDICATE THE MAXIMUM SCORES ACHIEVED.THE * INDICATES THAT THE RESULTS ARE STATISTICALLY SIGNIFICANTcomplaint severity-level information can boost the CI task while simultaneously learning these two tasks.This could also be influenced by the notion that sentiment by itself is usually unable to convey essential details about the complainant's predicament.For instance, negative feelings toward a company or a product might be influenced by numerous emotions such as anger and sorrow.Consequently, distinct or minor variations in emotion cannot be accurately determined and conveyed solely through sentiment.Specifically, in the case of tri-task variants, MTL CI+ER+CS outperforms other tri-task baselines with the macro-F1 score

TABLE X FEW
EXAMPLE INSTANCES ALONG-WITH ASSOCIATED PREDICTED LABELS FOR THE BEST PERFORMING MULTITASK (MTL-ROBERTA) MODEL AND ITS SINGLE-TASK CI EQUIVALENT.(STL CI ) Table X presents a few examples of MTL-RoBERTa and single-task (complaint task) model predictions.Even though the first and second cases in Table X communicate dissatisfaction in a less explicit tone, the MTL-RoBERTa model accurately predicts the examples as complaints.Even though the third and fourth examples are indirect and vague in nature, the MTL All model identifies that there is no violation of trust.These examples highlight the importance of detecting complaints based on severity, emotion, and sentiment knowledge.