Starting a new side project
I'm a data engineer by day, typically spending my time creating and support ETL type jobs in both SQL Server environments, and a Hadoop environment. I finally found a pet project to work on that will let me stretch some skills, play with stuff I don't have time for at work, and show off a bit.
One of the biggest problems with this search was finding something that simultaneously entertained me, used and expanded the skills I already have, and was actually big data.
Twitter Sentiment Has Been Done Before
It's true - people do it all the time, in a ton of different ways. So, I knew I didnt want to do straight analysis of a single tweet, or even just a stream of tweets individually.
Instead, I'm going to segment it by topics, over time, and get the changes of sentiment over time. Then, I'm going to collect replies to those tweets and see how the public's feelings react to those. I'm doing this, initially, on a very well-known, controversial public figure who uses Twitter quite a bit, but I plan to make it generic such that I can apply it to just about anyone with the appropriate amount of data.
I've also had the idea to do this with other public data, namely presidential speeches and US Supreme Court Opinions. I think that may come at a future time though, as it might require a little more manual data sorting.
Data Structure
Using Twitter's API responses as a baseline, and with the plans I have for this data, I think I need to store the following items for each tweet I get:
- Tweet Id
- Posted datetime
- Author
- Raw Text
- URL
- Retweet Count
- hashtags as array
- in_reply_to_status_id - the Tweet Id that this is a reply to
- in_reply_to_screen_name
In Spark, I'll be generating the sentiment and other analysis, then joining things together using the SparkSQL APIs to create the end result dataset.
Pulling this data is going to take quite some time - I found an initial download of a twitter dataset, but it does not give me the replies or anything, so I'll need to start creating code to get those replies and working on that. Since the public Twitter API only makes the last 7 days available, what I'll be doing is first making the skeleton of what I want to do, then looking at either spending a little money to access the paid API, or finding a way to do it that will get me more historical data.
What I'll be doing is getting all of the tweets I need, and replies to that user, on a period basis from the API, making sure to overlap, then deduplicate and consolidate edits and whatnot in a later step.