Text Sentiment Analysis Project with LSTM, CNN
This is a supervised learning project. The Text Sentiment Analysis is combined with three parts, the text data preprocess, the text data representation, three models (CNN, KNN,LSTM). This is an SJSU course project. I use the python to implement those model and generate the result. Here is the detail.
- the data preprocess
Here is the sample data.
The left side number is the target sentiment label, the right side text is the input text. From the text sample, more likely, those data is got from tweeter-similar samples by the professor of the course.
The text data will contains lot of unnecessary tokens for the text analysis. The preprocess data will help to reduce the computation effort by decreasing data dimension (i.e. reduce the words in a sentence), and also help to reduce the noise of data. The process is including following parts:
- )Delete the stop word.
The sentence sentimental result is not related to the subject. For example, “wow, I got a birthday gift” and “Wow, got birthday gift” are the same sentimental result in the human’s prospective, although it is different in the machine binary data view.
- )Modify the stem word.
We can reduce the dimension by combined into the same word for different stemming. For example, “swimming is good for health” and “swim is good for health” are same sentimental meaning.
- )Proofreading for sentence
It is inevitable that it may have spelling error in the sentence. It may cause unnecessary extra vocabulary for our data handling.
- )Drop some punctuation.
This maybe an argument point. Some punctuation may have special meaning, for example, the Exclamation (!) symbol may contribute to a sentimental result. However, the period, the comma, the “;” are useless. We should drop all of them.
All these data preprocess I handled are using the Textblob package. You can use the below python statement to implement the above preprocess.
- The text data representation.
In the article, it will show two different types of text sentence data representation: the word2vec model and the Word Frequency Model.
- )the word2vec model
As we know, words in our language are just a symbol. It has meaning because our human brain explain the symbol to some sentimental meaning. Human brains provides the abstract relationship between words, sentence and up to a paragraph and a novel.
However, the word symbol represented in the computer machine world are just binary code (such as UTF-8 codig, Latin coding), which is meaningless for machine if the machine does not know the relationship between those words. That is the invention of word2vec model for computer machine. Word2vec model will translate a word into a vector, which will provide the relationship between words via vector calculation, such as distance. Here is an example to get the most similar word for “Shouldn’t”.
The example shown that the word2vec model did provide the relationship between words.
- )the Word Frequency data Model
We will count each word frequency to form the vector of a sentence. In the article, we will use the Kersa Tonkenize function to get the result.
- Imbalance data handling
As most of the learning algorithm are built to maximize the total accuracy. Imbalance data will cause the algorithm will have bias on the majority class. For extreme case, the predict result may only predict one majority class for binary-class data.
We will use the resample technique to make the count of minority classes same as the majority class. In our sample data, here is the original frequency of different classes.
So, through the resample technique, we will have minority classes of 5, 11, 6, 9, 8, 7, 4, 1, 10, 2 will have the same frequency of class of 3.
- Three Models comparison
- )NN model with word2vec data model
For each training text, we will get each word’s word2vec, which is a dimension of 200 for each word. After that, we will get the average word2vec for each training text as our model’s input. Below is the model diagram generated by Keras.
And here is the epoch training detail. From the result, we can see the training loss is decreasing a lot, however, the validation accuracy is not increase so much.
- )RNN(LSTM) model with frequency data model
LSTM is useful when we need to consider the word’s relationship in a sentence. The data input will be the digitized word in each training record. I have tried the simple LSTM without the CNN, however, the result is not so good.
Below I add the LSTM into the CNN model as one out of four score criterial.
- )KNN with word frequency model
Below is the code excerpt. We construct the training data frequency matrix via the Keras Tokenizer function.
After that, we make use of the sklearn KNeighborsClassifier for the training.
According to the validation accuracy result, the better way to do that is using the LSTM model. The course project gives me lot of chance to try different models, and to implement python programs to using the pre-trained glove.twitter.27B.200d.word2vec for our model to avoid the lacking of our sample data. I have tried lots of models, such as CNN, NN, SVM and SVM. It is a good journal for the project.
How to get a good score in SAT/PSAT: a Journey from ELD to SAT 1580
It has been 2 months since the official SAT/PSAT score is published. I have twoRead More......
A note on BNF Parser to verify valid number: a problem from leetcode
BNF Parser is one of the step in the compiler theory to analyze the syntaxRead More......