This project was created as a course deliverable in ITU Data Minings course. The main goal of the project was to test out existing Natural language processing (NLP) techniques.
Why sentiment analysis?
The demand for sentiment analysis has increased significantly in recent years, as the rate of data creation is growing rapidly. Reports, reviews and news scrapping techniques usually contain useful information that can be used for different purposes such as business intelligence and customer analysis. A problem we usually face by doing manual sentiment analysis is that a large text is only partially useful. Because of datasets containing thousands lines of text it is impossible for any human being to process. For this purpose different text mining and natural language processing methods are used. With these methods it becomes possible to filter relevant parts of the text, make predictions on attributes and cluster it by relevance. In our task we took a dataset ”Reviews and ratings of airports 2015” from airlinequality.com website. With this dataset we tried to cluster the text to find meaningful information and based on attempt to predict if the person recommended the given airport or not. The dataset contained some basic information about reviewer - origin country, date and purpose of the flight, review itself and rating on the different parts of the airport like ”wifi” or ”shopping”.
How did we do it?
In this project, three different approaches were tested - K-means, Random forest classifier and multilayer perceptron. For data pre-processing step, stopword removal and TF-IDF transformation techniques were used. We also tested different solvers with multiplayer perceptron approach. In the end - we managed to achieve around 70% accuracy.
Looking back to the project - there are a lot of things we could improve upon. For pre-processing more advanced word embedding technique could be used. It would be also interesting to test out advanced classifiers like LSTM cell. However, this assignment symbolizes the beginning of my machine learning career, when I knew nothing about the techniques and possibilities within the field. For this reason I hold this project very dear to my heart :)
What are the results?
At that time, in our opinion the overall performance of multilayer per-ceptron was acceptable. From the different test cases we manage to find the highest accuracy activation function and solver. In general the multilayer per-ceptrons produced respectable accuracy. We also discovered an interesting relation between stop word removal and accuracy change. It was seen that stop word removal managed to extract more features for predicting positive review, but also lowered accuracy of negative once. These features can be researched more in future work using more sophisticated algorithms such as convolutional, recurrent neural networks or unsupervised learning.
Full project repository can be found here.