THE ERRANT ANALYST
Updated 796 days ago
Comments are collected twice daily through the Reddit API and results of the analysis are updated after each close. Pre-processing covers duplicates, bot commentary, daily summary and an application of regex to improve identification of ticker mentions and performance of machine learning libraries. NLTK, Textblob and Spacy are used to process stop words, tokenise, lemmatise and generate sentiment and subjectivity scoring. Vectorisation of comments occurs via TF-IDF. Thus far clustering and dimension reduction have been most successful at three clusters when subjectivity is not considered.