Information Extraction From Text Messages Using Data Mining Techniques

downloadDownload
  • Words 1777
  • Pages 4
Download PDF

Abstract

We are living in an era of high pressure and mental disorders. The high level of stress and pressure results in a tendency of the number of people showing suicidal attempts and thus a larger number of people are committing suicide. Stress can be caused due to family disputes, job dissatisfaction, health issues, etc. In the world of modern computing, people feel free to share their views and feelings over social media with peers and family members via services such as messaging. Due to the reserved nature and busy schedules of people, it is extremely tough to interact with neighbours and family members, therefore social media platforms are considered as the most used platform for personal conversations. The aim of this paper is to estimate the suicidal tendencies of a person by applying data mining techniques to the text messages a person sends to the associated people. By analysing the components of the text messages (keywords and emoticons) we can estimate the suicidal tendencies of a person so that necessary steps can be taken in order to save the life of the people.

Keywords

Text mining, knowledge discovery, sentiment analysis, opinion mining.

Click to get a unique essay

Our writers can write you a new plagiarism-free essay on any topic

1. Introduction

The need for applying data analysis techniques to text massages due to increasing suicide rates in different parts of the world. Saving the life of humans is the prior task for the growth of a nation. In order to save the life of people, their sentiments must be known, so that the required steps can be taken on time. The best way to know about the sentiment of a person is by applying data mining techniques to the text messages a person sends. If a person shows symbols of hyper stress then informing

the people close to that person can help in saving the life of that particular person. Text processing is applied on the text obtained from the user. Text pre-processing involves tokenization, stop-word removal and stemming and some other techniques. Tokenization involves splitting the text in the form of words called tokens. Tokenization is used to identify keywords in the stream of texts. Stop-word removal is the process of removal of words that do not convey a special meaning in the document like the, and, this … etc. Stemming is done to obtain the root word of the data and remove suffixes like -ing, -ion, etc.

This paper focuses on sentiment analysis for predicting the stress level of a person. The prediction model comprises of SVM and K-NN algorithms. This is done by feeding the system with a data set for training the system. This framework can be used in different scenarios regarding other domains. This approach can be used to predict the results of elections when applied at a larger scale and for multiple subjects. It is highly effective in predicting the results regarding different opinions of people. It can be used to get prior knowledge about terror attacks or unorganized violent protests.

Emoticons are very important part of any textual message over the internet. It is also known that they are useful to convey the messages in an expressive manner as they convey the real essence of the conversation between the two counterparts. Hence, it is very important to analyse the emoticons used in any text message so that the real sentiment of the text is accessible.

2. Proposed Methodology

The proposed methodology helps to save the life of people who may be undergoing problems of overstress or any other factor that may harm them.

The aim is to extract information from the text messages of the user and use it for different purposes such as sentiments analysis. The model also includes the analysis of emoticons in order to completely parse the statements.

The information is obtained by fetching all the text messages sent by the particular person. This can be gained from multiple sources such as FB, Whatsapp, etc. All the messages sent through WhatsApp or FB are stored in a database where we can apply our model and analyse the sentiments. The information is in the text form of data and emoticons. No other form of data like images will be analyzed through this model.

3. Model Components

Sentiment analysis

In this component, the data is assigned a sentiment such as positive or negative and the extent of it by performing data pre-processing using the SVM algorithm.

Text Pre-processing

The processes involved in text pre-processing are. Tokenization: Every new message is split into meaningful words called tokens. Example – “Morning walk is a bliss” is converted to “Morning” “walk” “is” “a” “bliss”.

Data standardization: It involves converting all words in the message in standard form, converting all words in lower case.

Example. “The market is near Puneet’s house” is converted to “the market is near Puneet’s house”.

Emoji conversion: The emoticons present in the text messages are assigned a keyword-based on the expression they convey.

The emoticons are classified into the following two categories: Positive emoticons: are the emoticons which convey positive sentiment and are replaced by positive words based on the symbol.

Figure 1. Positive Emoticons

Negative emotions: these emoticons reflect the sad or disturned sentiments of a particular person and are thus replaced by negative words.

Figure 2. Negative Emoticons

Stop-word-removal: All the words in the message which do not convey a special meaning are removed like a, the, then, etc.

Stemming: It involves obtaining the root word corresponding to every word by dropping suffixes ling -ing, -ion, ed etc. Abbreviation analysis: Replacing the abbreviations present in the message by their full forms. Example FB by Facebook, GN by good night, etc.

Figure 3. Steps in Text Message Processing

N-gram

The next step after data pre-processing is N-gram features extraction. N-gram is a series of n tokens. N-gram is a model very widely used in NLP tasks. The model creates N- grams from the messages in the data set to extract keyword features from the data set.

For n = 3 a sequence of three words for each message is generated. The process of N-gram increases the efficiency and accuracy of the classification step because of the feature extracted from three sequences of token combinations.

Example. “What is your name” is analysed as “what is your” “is your name”.

Term Frequency

The number of times a token occurs in each data sample is called its term frequency. Words having high frequency have a better relationship with the sample.

Inverse Document Frequency

Idf factor is used to diminish the weight of words that occur very often in the data set and to increase the weight of words that occur rarely.

Support Vector Machines

The resulting stream of words after the text pre-processing step is processed by SVM Algorithm in order to classify the messages as “normal” or “critical” sentiment. The process is applied to every message in the data set in order to classify the chat as one among “normal” and ”critical” sentiments. Thus we will get a sentiment associated with the messages associated with the user. SVM’s are supervised learning models which are used for classification and regression analysis of data used. An SVM model represents examples as points in space, different classes of examples are divided by a certain gap which must be as wide as possible. New examples when mapped into the space are predicted to belong to a class of examples based on which side of the gap they fall.

feature set. Data is divided into training and testing sets, and the KNN algorithm is used to predict the sentiment. KNN algorithm is a method for classifying data based on the nearest training sets in the feature space. The class label is assigned the same class as the nearest K instances in the training set. KNN is a type of lazy learner strategy. KNN algorithm is considered a flexible and simple classification technique based on machine learning concepts.

4. Result Analysis

The result obtained from the proposed model gives the estimated sentiment prediction of the subject based on the text messages sent by the user. The resulting output can be used in many situations, the mental disorders and stress level is estimated and therefore in case of “critical” sentiments the peers and family members of the subject can take actions to encourage, motivate and uplift the emotional stature of the subject thus resulting in the harmony and peace of mind of the subject. Therefore such sentiment analysis models are a requirement for shaping society into a happening place.

5. Future scope

The proposed model can be used in situations where sentiment analysis is required to achieve the desired result and use it for various different purposes such as critic reviews for hotels, movies, videos, etc. Sentiment analysis methods till now have been used to detect the polarity in the thoughts and opinions of all the users that access social media. Busi- nesses are very interested to understand the thoughts of people and how they are responding to all the products and services around them. Companies use sentiment analysis to evaluate their advertisement campaigns and to improve their products. Companies aim to use such sentiment analysis tools in the ar- eas of customer feedback, marketing, CRM, and e-commerce.

Figure 4. Different Steps in Data Processing and Analysis KNN Algorithm

The output obtained from Support Vector Machines Algorithm

[9] are clusters of two sentiments with class labels “normal” and “critical”. Based on the output KNN algorithm is applied in order to deduce the overall sentiments of the subject. The input for the KNN algorithm is the sentiments associated with all the chats that the subject is involved in. The last step is to predict the sentiment of the person based on the collected

6. Conclusion

The proposed model takes input from the data set created by accumulating all the text messages sent by the subject. All the messages may be from different social media platforms such as Facebook, WhatsApp, etc. The messages are then pre-processed to obtain the keywords from the data sets. After pre-processing we use probabilistic language models like n-gram. Associating weights to the data set using TF-Idf increases the overall efficiency of classifying algorithms. The next step is to use the classifying algorithms to classify the conversations as “normal” or “critical”. First, a supervised algorithm is used which is SVM as it proves to be highly efficient for such computations and then an unsupervised algorithm is used which in turn increases the efficiency drastically, in our case we use the KNN algorithm. Thus we propose to give a highly efficient method of finding the sentiment of the person by analysing the text messages and also processing emoticons. Emoticons are very common tokens in any text message in the new world, therefore we must also focus on efficient ways to analyse them. We have converted emoticons to textual form for our computation processes. Thus this model is a requirement and a life saviour in the modern world.

image

We use cookies to give you the best experience possible. By continuing we’ll assume you board with our cookie policy.