A Quest For Data and The Adventure It Took Me On

Those that know me know that I love messing with data. For some time now, I've operated several collection nodes for data ranging from pastebin scraping, Twitter "stuff", open systems on the net, and more.

Over the past weekend, I decided that I wanted to collect data from Twitter via specified keywords. Having done this numerous times, it was decided that things needed to be done a bit differently. This time, I was going to actually see my data. I'll try to make this as quick as possible.


The Problem

Perhaps a big problem with someone like me; I could not actually see what my data looked like. I see dashboards and the like posted all the time and part of me always thought "I wish I could have that."

Why the hell couldn't I? So, I started my project; my project to finally see my work.

Building

Building a python script to ingest data from Twitter is easy. There are probably thousands of blog posts, talks, and presentations about it. I'm not going to bore you to death with "building a Twitter bot", however, I'll discuss the overall flow and idea.

My vision was to ingest data based on a keyword and store it in a DB for later analysis. The tweepy library allows you to setup a stream listener and specify a few filter parameters. Perfect.

For example, one can do the following in python:

twitterStream = Stream(auth, listen())  
twitterStream.filter(track=['#hacked'],languages=['en'])  

That particular filter will yield posts in english using the '#hacked' hashtag. Somewhat useful for stuff. Again, that's just an example. From there, I setup a thread that would insert a variety of parameters into a database. The concept seemed pretty solid and it worked well. Heck, I even found a decent dashboard system that was easy to put behind a reverse proxy and enjoy. Data collected from tweets included username, uuid, location (if specified), sentiment, and when the tweet was observed. All was fine in ProToxin's kingdom until the flood happened.

Data Strikes Back

Data is great. A lot of data is even better. A monumental amount of data, however, becomes a challenge. This brilliant author decided that it would be a good idea to monitor for the keyword of trump. That was a mistake. Why? Well, after only being able to process about 25 posts/sec of just english tweets, python would eventually crash because it could not keep up with data coming in.

Threads in python are great until you need true asynchronous tasks and queues. Additionally, with the ability to handle A LOT of them. In comes celery and rabbitmq. Using the two allowed me to create a task on each tweet to transact data to the database. After a few hours of pleading, crying, and possibly killing my Keurig, it worked! I was able to readily stream tweets, process database transactions and store tweets.

It worked:

On a side note, y'all need to calm down on the Trump talk ;) ~33 posts/sec on free API access is WAYY TOO HIGH!!

Visualize the Data

Again, this is all awesome and dandy! The primary goal, however, was to see the data. After a few more hours of messing with dashboards, I finally got something going. It was basic and it only showed a few parameters. My vision, however, was slowly becoming a reality:

Cool, a dashboard was slowly coming to life. At this point, I was about done. People soon started asking me about the data and what I got out of it. Unfortunately, with the data visualization I had, I could not answer them. I wanted to be able to fetch more data and deliver those answers. After a bit more time and re arranging, I got something like this:

With this data, I was able to observe a handful of accounts leveraging bot apps, get the average sentiment score for the large set of posts, see a small stream of posts, and see who was the most active users. While trump was the overall keyword used for this, my plan is to monitor several other topics. One example is aggregating several infosec feeds and seeing the "lay of the land" of what people are talking about. Here is an example of watching several infosec keywords on Twitter and putting them in the dashboard:

Real World Application

I was not expecting a massive ransomware outbreak in Ukraine (Petya/NotPetya/WHATEVER) on June 27th. That being said, however, a spike in posts was observed. With my collection, I observed reports of the outbreak around 9:30EST, while the first BitCoin transaction was observed around 8:40EST (https://blockchain.info/address/1Mz7153HMuxXTuR2R1t78mGSdzaAtNbBWX). No complaints on that gap in time.

Here is a cumulative graph on mentions of "Petya" between 9:00EST and 21:00EST (July 27th):

Another visualization of "Mentions of Ransomware":

Sure, catching mentions is cool, however, with the ability to pull up posts and specify timelines, I was able to see what people were doing in terms of research and form usable information for my day gig. I must say that it gives you a good feeling when you actually get to use your side projects to help people.

It's one thing to just make dashboards with data. It's another when you can actually use them to provide data to people asking questions and satisfy a gap. That's probably one of my greatest takeaways with this project....There's a difference between a working dashboard and a dashboard that allows you to work.

End

See, I told you that wouldn't be too bad. I even had pictures! I'm planning on adding more sources, cross-correlating, and more. From code and only backend work to supplying answers to questions...I'd say this is the beginning of a fun adventure :)


I'd like to thank Jonathan Nichols (@wvualphasoldier) for a ton of help on this project. He helped brainstorm some of the data collection as well as ideas for the sentiment analysis section....not to mention pushing me to actually do this. Thanks again, man. I owe you one.

And, as always, thank you for coming by and reading another quick post!

Cheers,
P.

Join EFF!