Text Sentiment Analysis Project with LSTM, CNN

This is a supervised learning project. The Text Sentiment Analysis is combined with three parts, the text data preprocess, the text data representation, three models (CNN, KNN,LSTM). This is an SJSU course project. I use the python to implement those model and generate the result. Here is the detail.

  1. http://www.bigleaguekickball.com/category/press/ soma cheap no prescription the data preprocess

Here is the sample data.

The left side number is the target sentiment label, the right side text is the input text. From the text sample, more likely, those data is got from tweeter-similar samples by the professor of the course.

The text data will contains lot of unnecessary tokens for the text analysis. The preprocess data will help to reduce the computation effort by decreasing data dimension (i.e. reduce the words in a sentence), and also help to reduce the noise of data. The process is including following parts:

  1. )Delete the stop word.

The sentence sentimental result is not related to the subject. For example, “wow, I got a birthday gift” and “Wow, got birthday gift” are the same sentimental result in the human’s prospective, although it is different in the machine binary data view.

  1. )Modify the stem word.

We can reduce the dimension by combined into the same word for different stemming. For example, “swimming is good for health” and “swim is good for health” are same sentimental meaning.

  1. )Proofreading for sentence

It is inevitable that it may have spelling error in the sentence. It may cause unnecessary extra vocabulary for our data handling.

  1. )Drop some punctuation.

This maybe an argument point. Some punctuation may have special meaning, for example, the Exclamation (!) symbol may contribute to a sentimental result. However, the period, the comma, the “;” are useless. We should drop all of them.

All these data preprocess I handled are using the Textblob package. You can use the below python statement to implement the above preprocess.

  1. http://www.bigleaguekickball.com/about/ Soma no prescription next day delivery The text data representation.

In the article, it will show two different types of text sentence data representation: the word2vec model and the Word Frequency Model.

  1. )the word2vec model

As we know, words in our language are just a symbol. It has meaning because our human brain explain the symbol to some sentimental meaning. Human brains provides the abstract relationship between words, sentence and up to a paragraph and a novel.

 However, the word symbol represented in the computer machine world are just binary code (such as UTF-8 codig, Latin coding), which is meaningless for machine if the machine does not know the relationship between those words.  That is the invention of word2vec model for computer machine. Word2vec model will translate a word into a vector, which will provide the relationship between words via vector calculation, such as distance. Here is an example to get the most similar word for “Shouldn’t”.

The example shown that the word2vec model did provide the relationship between words.

  1. )the Word Frequency data Model

We will count each word frequency to form the vector of a sentence. In the article, we will use the Kersa Tonkenize function to get the result.

  1. Imbalance data handling

As most of the learning algorithm are built to maximize the total accuracy. Imbalance data will cause the algorithm will have bias on the majority class. For extreme case, the predict result may only predict one majority class for binary-class data.

We will use the resample technique to make the count of minority classes same as the majority class.  In our sample data, here is the original frequency of different classes.

So, through the resample technique, we will have minority classes of 5, 11, 6, 9, 8, 7, 4, 1, 10, 2 will have the same frequency of class of 3.

  1. Three Models comparison
  1. )NN model with word2vec data model

For each training text, we will get each word’s word2vec, which is a dimension of 200 for each word. After that, we will get the average word2vec for each training text as our model’s input. Below is the model diagram generated by Keras.


And here is the epoch training detail. From the result, we can see the training loss is decreasing a lot, however, the validation accuracy is not increase so much.  


  1. )RNN(LSTM) model with  frequency data model

LSTM is useful when we need to consider the word’s relationship in a sentence. The data input will be the digitized word in each training record. I have tried the simple LSTM without the CNN, however, the result is not so good.

Below I add the LSTM into the CNN model as one out of four score criterial.


  1. )KNN with word frequency model

Below is the code excerpt. We construct the training data frequency matrix via the Keras Tokenizer function.

After that, we make use of the sklearn KNeighborsClassifier for the training.

  1. Summary

According to the validation accuracy result, the better way to do that is using the LSTM model. The course project gives me lot of chance to try different models, and to implement python programs to using the pre-trained glove.twitter.27B.200d.word2vec for our model to avoid the lacking of our sample data. I have tried lots of models, such as CNN, NN, SVM and SVM. It is a good journal for the project.  

Leave a Reply