Scraping Russian Twitter Trolls With Python, Neo4j, and GraphQL
12 Nov 2017
Last week as a result of the House Intelligence Select Committee investigation, Twitter released the screen names of 2752 Twitter accounts tied to Russia’s Internet Research Agency that were involved in spreading fake news, presumably with the goal of influencing the 2016 election. In this post we explore how to scrape tweets from the user pages of cached versions of these pages, import into Neo4j for analysis, and how to
build a simple GraphQL API exposing this data through GraphQL.
Russian Twitter Trolls
While Twitter released the screen names and user ids of these accounts, they did not release any data (such as tweets or follower network information) associated with the accounts. In fact, Twitter has suspended these accounts which means their tweets
have been removed from Twitter.com and are no longer accessible through the Twitter API. Analyzing the tweets made by these accounts is the first step in understanding how social media accounts run by Russia may have been used to influence the
US Election. So our first step is simply to find potential sources for the data.
Internet Archive
Internet Archive is a non-profit library that provides cached version of some websites: a snapshot of a webpage at a given point in time that can be viewed later. One option for obtaining some of the Russian Troll
tweets is by using Internet Archive to find any Twitter user pages that may have been cached by Internet Archive.
For example, if you visit web.archive.org/web/2017081… we can see the Twitter page for @TEN_GOP, one of the Russia Troll accounts
that was designed to look like an account associated with the Tennessee Republican party.
This snapshot page contains several of @TEN_GOP’s most recent tweets (before the snapshot was taken by Internet Archive).
Finding Available Cached Pages
Using the screen names provided by the House Intelligence Committee we can use Internet Archive’s Wayback API to see if the user’s Twitter profile page was cached by Internet Archive at any point in time. We’ll write a simple Python script to iterate
through the list of Russian Troll twitter accounts, checking the Wayback API for any available cached pages.
We can do this by making a request to http://archive.org/wayback/available?url=http://twitter.com/TWITTER_SCREEN_NAME_HERE
. This will return the url and timestamp of any caches made, if they exist. So we iterate through the list of twitter
screen names, checking the Wayback API for any available caches.
1 | 复制代码import requests |
With this, we end up with a file twitter_handle_urls.csv
that contains a list of Internet Archive urls for any of the Russian troll accounts that were archived by Internet Archive. Unfortunately, we only find just over 100 Russia Troll
accounts that were cached by Internet Archive. This is just a tiny sample of the overall accounts, but we should still be able to scrape tweets for these 100 users.
Scraping Twitter Profile Pages
Now, we’re ready to scrape the HTML from the Internet Archive caches to extract all the tweet content that we can.
We’ll make use of the BeautifulSoup
Python package to help us extract the tweet data from the HTML. First, we’ll use Chrome devtools to inspect the structure of the HTML, seeing what elements contain the data we’re looking for:
Since the caches were taken at different times, the structure of the HTML may have changed. We’ll need to write code that can handle parsing these different formats. We’ve found two versions of the Twitter user pages in the caches. One from ~2015,
and one used around ~2016-2017.
Here is the code for scraping the data for one of the versions. The full code is available here.
1 | 复制代码import urllib |
BeautifulSoup allows us to select HTML elements by specifying attributes to match against. By inspecting the structure of the HTML page we can see which bits of the tweets are stored in different HTML elements so we know which to grab with BeautifulSoup.
We build up an array of tweet objects as we parse all the tweets on the page.
1 | 复制代码 |
Once we’ve extracted the tweets we write them to a json file:
1 | 复制代码# write tweets to file |
We end up finding about 1500 tweets from 187 Twitter accounts. This is only a fraction of the tweets sent by the Russian Trolls, but is still too much data for us to analyze by reading every tweet. We’ll make use of the Neo4j graph database to help
us make sense of the data. Using Neo4j we’ll be able to ask questions such as “What hashtags are used together most frequently?”, or “What are the domains of URLs shared in tweets that mention Trump?”.
Importing Into Neo4j
Now that we have our scraped tweet data we’re ready to insert into Neo4j. We have several options for importing data into Neo4j. We’ll do our import by loading the JSON data and
passing it as a parameter to a Cypher query, using the Python driver for Neo4j.
We’ll use a simple graph data model, treating Hashtags and Links as nodes in the graph, as well as the Tweet and User who posted the tweet.
Datamodel
1 | 复制代码from neo4j.v1 import GraphDatabase |
Graph Queries
Now that we have the data in Neo4j we can write queries to help make sense of what the Russian Trolls were tweeting about.
Interesting Queries
1 | 复制代码// Tweets for @TEN_GOP |
1 | 复制代码// What are the most common hashtags |
1 | 复制代码// What hashtags are used together most frequently |
1 | 复制代码// Most common domains shared in tweets |
GraphQL API
In addition to querying Neo4j using Cypher directly, we can also take advantage of the neo4j-graphql integrations to easily build a GraphQL API for our tweets.
First, we define a GraphQL schema
1 | 复制代码type Tweet { |
Our GraphQL schema defines the types and fields available in the data, as well as the entry points for our GraphQL service. In this case we have a single entry point Hashtag
, allowing us to search for tweets by hashtag.
With the neo4j-graphql-js integration, the GraphQL schema maps to the graph database model and translates any arbitrary GraphQL query to Cypher, allowing anyone to query the data through
the GraphQL API without writing Cypher.
Implementing the GraphQL server is simply a matter of passing the GraphQL query to the integration function in the resolver:
1 | 复制代码import {neo4jgraphql} from 'neo4j-graphql-js'; |
React App
One of the advantages of having a GraphQL API is that makes it very easy to build web and mobile applications that consume the GraphQL service. To make the data easily searchable we’ve make a simple React web app that allows for searching tweets in
Neo4j by hashtag.
Here we’re searching for tweets that contain the hashtag #crime
. We can see that a Russia Troll account @OnlineCleveland is tweeting fake news about crimes in Ohio, making it seem that more crime is occurring in Cleveland. Why would a
Russia Troll account be tweeting about crime in Cleveland leading up to the election? Typically when voters want a “tough on crime” politician elected they vote Republican…
In this post we’ve scraped tweet data from Internet Archive, imported in Neo4j for analysis, built a GraphQL API for exposing the data, and a simple GRANDstack app for allowing anyone to easily search the tweets
by hashtag.
While we were only able to find a small fraction of the tweets posted by the Russian Twitter Troll accounts, we will continue to explore options for finding more of the data ;-)
All code is available on Github at github.com/johnymontan….
本文转载自: 掘金