Twitter Scraping

A while back, I decided to build a little Twitter scraper to scrape real-time tweets. This scraper scrapes real-time tweets (with or without a keyword filter), and store the tweets as a .json file. Below I share my process and code.

The scraper is built using the tweepy and json Python libraries. Given below is the code that shows which libraries to import.

import tweepy
import json
import re
from tweepy import StreamListener
from tweepy import OAuthHandler
from tweepy import API
import time
import nltk

Now you need to go back on Twitter and find the access keys and tokens associated with your Twitter developer account. You can find detailed directions here. You will end up on a page like the one given below, from where you can copy your keys.

Once we have the credentials, we will assign them to correctly named variables accordingly.

#Replace twitter credentials
consumer_key = "xxxxxxxxxxxxxxxxxx"
consumer_secret = "xxxxxxxxxxxxxxxxxx"
access_token_key = "xxxxxxxxxxxxxxxxxx"
access_token_secret = "xxxxxxxxxxxxxxxxxx"

Now initialize your scraper by passing OAuth details to the tweepy OAuth handler. OAuth is an authorization tool that lets third-party applications access and use your Twitter account, without making you share your password. If you want to know more, go here.

# Pass OAuth details to tweepy's OAuth handler

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token_key, access_token_secret)

Now we will create a stream listener class as given below. A listener handles tweets that are received from the Twitter stream. Copy the following script.

#creating a new listener class (as taught by Alex Hanna)
from tweepy import StreamListener
class SListener(StreamListener):
    def __init__(self, api = None, fprefix = 'streamer'):
        self.api = api or API()
        self.counter = 0
        self.fprefix = fprefix
        self.output  = open('%s_%s.json' % (self.fprefix, time.strftime('%Y%m%d-%H%M%S')), 'w')
 
 
    def on_data(self, data):
        if  'in_reply_to_status' in data:
            self.on_status(data)
        elif 'delete' in data:
            delete = json.loads(data)['delete']['status']
            if self.on_delete(delete['id'], delete['user_id']) is False:
                return False
        elif 'limit' in data:
            if self.on_limit(json.loads(data)['limit']['track']) is False:
                return False
        elif 'warning' in data:
            warning = json.loads(data)['warnings']
            print("WARNING: %s" % warning['message'])
            return
 
 
    def on_status(self, status):
        self.output.write(status)
        self.counter += 1
        if self.counter >= 20000:
            self.output.close()
            self.output  = open('%s_%s.json' % (self.fprefix, time.strftime('%Y%m%d-%H%M%S')), 'w')
            self.counter = 0
        return
 
 
    def on_delete(self, status_id, user_id):
        print("Delete notice")
        return
 
 
    def on_limit(self, track):
        print("WARNING: Limitation notice received, tweets missed: %d" % track)
        return
 
 
    def on_error(self, status_code):
        print('Encountered error with status code:', status_code)
        return
 
 
    def on_timeout(self):
        print("Timeout, sleeping for 60 seconds...")
        time.sleep(60)
        return

Now, let’s call the API using Oauth handler credentials, saved in auth, and initializing stream listener class that we prepared to scrape Twitter.

api  = tweepy.API(auth)
 
# Initialize Stream listener
l = SListener()

We can then call the stream object, using our Oauth handler object and Listener object.

# Create your Stream object with authentication
stream = tweepy.Stream(auth, l)

Finally, in this scraper, I am scraping tweets using two ways. If I want to scrape without searching for a particular keyword, I can use sample to scrape everything that is being tweeted in real-time.

stream.sample()

And, if I want to look for a specific keyword, I go ahead and use a filter to sift through the tweets and pick only the ones with the mentioned keyword in them, e.g. keyword = covid.

# Filter Twitter Streams to capture data by the keywords:
stream.filter(track = ['covid'])

We get something that looks like this!

via GIPHY

Geetanjali Bihani
Geetanjali Bihani
PhD Candidate