Sentiment Analysis Of Shared Tweets On B.1.1.7 Strain Of Sars-Cov-2 On Twitter With Data Mining Methods

Yaprak Kurtlutepe
9 min readMar 20, 2021

The novel COVID-19 has become the most serious health epidemic of the 21st century. According to the World Health Organization (WHO), it has been spread over more than 150 countries and territories worldwide with thousands of deaths. In this research, we propose a TF-IDF weighting to explore society’s perception of mutated coronavirus strain among Twitter users during the pandemic. In our study, the related tweets are retrieved from Twitter in one specific time interval and stored as a dataset. The data collection process began very soon after the virus mutated. This period is concurrent as it has become the most discussed topic on the internet and other media platforms. Thus, the most up-to-date data on this subject were obtained. After cleaning and pre-processed the data, using natural language processing and social network analysis techniques, sentiment analysis was made. Consequently, how society reacts to this phenomenon is analyzed.

The unexpected rising of the pandemic crisis created a challenging period for 21st-century society, where borders are evaporating progressively commonplace. In this period, each state followed different policies regarding the management of the epidemic. Certain enforced practices have also become the subject of discussion in social media. Twitter is one of the most active and heated social platforms where these discussions are experienced. One of the most effective ways to follow the world agenda instantly is to follow hashtags on Twitter. In this study, we wanted to analyze what the masses thought about the re-mutated coronavirus. According to the TF-IDF weighted contents of Twitter data, in the period from the first days of the pandemic to the last days, some changes have occurred in people’s hopes and expectations. The period until the mutation was the days when the effectiveness of vaccines increased. In a period when such an improvement is on the agenda, we analyzed how a negative mutation was reflected in people’s movements. In this way, by analyzing the Tweets that mention the mutated virus in social media, we observed that people feel a negative or positive feeling about this occurrence. In addition, this study does not only include the analysis of social behavior; it includes analyzing these human behaviors by natural language processing algorithms. In-consequence, an algorithm that can learn the up-to-date social media jargon correctly was needed. Therefore, this dimension of the study also contributes to natural language processing techniques. In this context, we will be informed about the emotional states of social media users about coronavirus via machine-learning methods.

The main question emphasized in the study, “What do people think about the mutated virus?”. In order to provide a neutral and objective interpretative analysis to this question, several tools are required preeminent. It is necessary to determine which form of machine learning will be used, how the data is collected, and which algorithms will be used. How should the data set, which is one of the most essential parts of the study, be prepared? Possible problems to be encountered while collecting data should be anticipated, i.e., will this dataset include any other accessed data other than tweet content? When we answer all these questions, this study will contribute to social sciences and computer sciences. The findings of the study may have multidisciplinary results, e.g., how the coronavirus affects human psychology in the context of social sciences, what data mining methods were used in this study in the context of computer sciences.

Attention Was Paid to The Anonymity of The Authors

A dataset is needed to analyze what society thinks of the mutated coronavirus variant. We preferred the Twitter environment to create this dataset and while collecting the data, we paid attention to the privacy of the users. We only included the Tweet content in the data set, regardless of who owns what was posted.

TF-IDF, One of The Data Mining Techniques, Was Used in This Analysis

Via the created dataset, TF-IDF weighting was used to see which words the users preferred in general and to classify them as positive or negative. Thus, the most frequently repeated words were found.

Provided A Guide For Future Work

By dint of the fast and reliable results offered by machine learning, the findings of the study will guide future studies. With this peculiarity, the study has an interdisciplinary identity in the field of social sciences and computer sciences.

As shown in figure 1, the architecture of the system is composed of 3 different phases. We can examine these phases under the headings of the data collection and extraction, data cleaning and pre-processing and sentiment analysis.

Figure 1. Architecture of the System
Figure 1. Architecture of the System

Twitter API Key

An application running over the internet is in a data flow with the server it is connected to. The main function of the server here, is to interpret the data it receives, perform the required action and deliver the required package to the client. The application interprets this incoming package in the desired format and presents it to the user in a readable format as requested. All these operations take place through the key we call API. API stands for Application Programming Interface.

Accordingly, we request an API Key from Twitter to directly access users’ Tweets. There is some information Twitter asks us to provide to get this tool. This process is merely to prevent the abuse of this tool. Twitter makes an assessment based on the information you provide and gives you access if there is not a problematic condition.

Data Collection and Extraction

After gaining API Key access, the contents of the desired word can be obtained with the help of a correct query. In order to analyze what society thinks of the mutated type of coronavirus, it is necessary to arrange an environment where these thoughts are put into writing. Therefore, we were able to analyze everybody homogeneously who expressed their opinion. In order to make the query, a word that summarizes the current situation in general terms was chosen; the word “mutated” was deemed convenient.

The B.1.1.7 strain of coronavirus first started to be talked about frequently after the date December 15, 2020. Following this date, news about this strain of virus increased on the internet and in press sources. For this reason, many Tweets consists the word “mutated” in it started to be posted on Twitter. The dataset used in this study started to be prepared on December 22, 2020.

There are some points in the dataset containing 10 days’ data:

  • Changes in the number of cases,
  • Positive and negative developments in vaccine and drug studies,
  • Comments and interpretations consist of some administrative applications of political leaders on this case,
  • People being influenced by each other’s thoughts.

All of these are parameters, and they have an impact on society’s perspective on the pandemic. In line with this information, it should be taken into consideration that this study includes only the emotional analysis of the period when the mutated virus is most on the agenda.

Tweets were collected in the English language and did not go through any translation process.

Data Cleaning and Pre-Processing

After collecting the data, it must be pre-processed in order to be fitly analyzed. In pre-processing step is completed by removing duplicated values, URLs, usernames, images and stop-words in the dataset.

Sentiment Analysis

Sentiment analysis, which is the main purpose of this project, is a machine learning technique, there can observe the modern approaches in sentiment analysis. In this approach, the algorithm is assigned to each sentence used to train the machine learning algorithm so that it can learn polarity or sentiments. (Devika & Ganesh, 2017) Emotional analysis has become a common word in natural language processing as the use of this type of analysis has become popular in the field of computer science. Sentiment analysis is a method applied to analyze text data and determine the main ideas, and in this direction, this process is also called opinion mining. (Pedipina, et.al., 2020)

Sentiment analysis is categorized into two levels depending on the text data for which sensitivity needs to be improved. (Pedipina, et.al., 2020) The method applied in this study has two main opinion polarities named positive and negative extracted from Tweets. This process is used to understand the main point of Tweets. (Schouten, Weijde, et al., 2017)

In the applied NLP algorithm, a recurrent TF-IDF weighting scheme is used along with Long Short-Term Memory (LSTM) layer to classify the Tweets with positive or negative polarities.

TF-IDF Weighting

TF-IDF stands for Term Frequency-Inverse Document Frequency. When calculating TF, every single word takes equal importance. If a word appears in the dataset more frequently its term frequency (TF) value is become high, however it is not important to the rest of the dataset. If the word appears in the dataset multiple times, then it is not carrying that much information compared to words that are less frequent in the dataset.

At that point there is a requirement to weighting down of the frequent words while scaling up the rare words, which decides the importance of each term.

This weighting has mathematical aspect.

This statistical document classifier used for information retrieval and text mining. It is a type of statistical weight to measure how important a word in a document or in a corpus. It measures the importance of a word. Via this way we can be able to understand the related document or dataset. (i.e., if a word appears many times in a dataset, this shows us it is a high frequency word compared to other words in the dataset.)

Monkey Learn Online Sentiment Analysis Tool

Monkey Learn is an online machine learning platform for text analysis. It allows to get actionable data from raw text.

It is an additional part of this work. When the data is analyzed on this platform, a negative or positive polarization is observed.

Results

The outputs of TF-IDF Weighting shows that, social media users’ Tweets are in a negative polarization. The most common words are:

  • Spread
  • Dead
  • Danger
  • Health

This study also consists of an online text analysis tool named Monkey Learn. From the analysis of Monkey Learn, there is a negative polarization at 63.2% confidence interval.

The results of this research reflect that the mutated coronavirus induced a negative perception in the social media users. The results of this research reflect that the B.1.1.7 strain of Sars-Cov-2 induced a negative perception in the social media users. It should be borne in mind that the reason for aforementioned finding may not be for individual reasons solely. Referring to the previous arguments, there are factors affecting the results of this analysis.

According to these findings, it should be assumed that the mutated virus creates negative feelings on the society.

There are also two different questions that can be drawn from this study, which does not include people who do not use social media:

  • What is the perception of pandemic by people who do not use social media?
  • Can people using social media easily influence each other’s opinions about the pandemic? (To answer this question; in the methodology, the identity of the Tweet authors should also be taken into account while collecting data by reason of the authors’ appeal to the masses influences how many followers they have or how important they are in the geographic region they are in.)

Thus, an initial level result has been obtained to nourish future studies.

References

Devika & Ganesh (2017). Devika, M. D., Sunitha, C., & Ganesh, A. (2016). Sentiment analysis: a comparative study on different approaches. Procedia Computer Science, 87, 44–49.

Pedipina, et.al. (2020). Pedipina, S., Sankar, S., & Dhanalakshmi, R. Sentimental Analysis On Twitter Data Of Political Domain.

Schouten, Weijde, et al. (2017). Schouten, K., Van Der Weijde, O., Frasincar, F., & Dekker, R. (2017). Supervised and unsupervised aspect category detection for sentiment analysis with co-occurrence data. IEEE transactions on cybernetics, 48(4), 1263–1275.

--

--