Jordan Wright, Olabode Anise, Duo Labs
Social media is a great way to have genuine conversations online, but the sphere is getting filled with bots, spam and attackers.
Not all bots on twitter are malicious - they could be giving us automated data on earthquakes, git updates, etc. So, their research was focused on finding bots and then figuring out if they were malicious.
The goal here is to build a classifier, one that could learn and adapt.
They wanted their research to be reproducible, so used the official Twitter APIs - though by doing so, they were rate limited. Because they were rate limited, they needed to be as efficient as possible. Fitting into that model, they were able to look up 8.6 million lookups per day.
Twitter's account ids started as sequential 32-bit unsigned integers, but the researchers started with random 5% sampling. The dataset has gaps - closed accounts, etc. Noticed accounts went up to very large numbers, and those accounts were up to 2016. But, Twitter changed to using "Snowflake IDs" - generated by workers, same format as other Twitter ids (tweets, etc).
The Snoflake ID is 63-bit, but starts with a timestamp (41-bits), then worker number (10 bits), then sequence (12 bits). It is very hard to guess these numbers. So, they used the streaming API with a random sample of public statuses (contains the full user object).
Now - they have a giant dataset :-)
Looked at last 200 tweets, accounts with more than 10 tweets, declared English and then they fetched the original tweets. This data was too hard to get - could only do 1400 requests/day.
They took the approach of starting from known bots and discovering the bot nets they were attached to.
The data they have include attributes (how many tweets, are they followed, in lists, etc), looking at tweet content (lots of links?), and frequency of tweets.
They examined the entropy of the user name, was it fairly random? Probably a bot. Same for lots of numbers at the begining or end. Watchin for ratios of followers to following and the number of tweets.
They applied heuristics to the content - like number of hashtags in tweets, number of URLs (could be a bot or a news agency!), number of users @ replied. On behavior - look at how long it takes to reply or retweet, and the unique set of users retweeted. Genuine users would go queit for periods (like when sleeping).
Then we got a Data Science 101 primer :-)
This is where it gets complicated and statistics come into play, and the reminder that your model is only as good as your data. For example, if they trained with the crypto currency bots, they found 80% of the other spam bots. when reversed, they only caught about 50% of the crypto currency bots.
Crypto currency give-a-way accounts are very problematic - they look legitimate and they will take your "deposit" and then you will lose your money. They were hard to find, until they realized that there are accounts are out there that have many bots following them. Find those legitimate accounts, then you can find the bots.... also following like behaviors, used to map relatinships. They found mesh and hub/spoke networks, but they were connected with likes.
They also discovered verified accounts that had been taken over, then they are modfiied to look like a more active account (like Elon Musk) that adds legitimacy to the crypto currency spam.
Very interesting research!
Great Expectations
-
I know we all love wedding wrecks with a schadenfreude-filled passion, but
when it comes to what-they-wanted vs. what-they-got wrecks, believe me,*
it's ...
No comments:
Post a Comment