Getting a Sample of NFL related Tweets

Downloading NFL Tweets 

This blog post outlines how we will use tweepy to develop a dataset for our NFL tweets project. This is just one component of a larger project

___________________________________________

Installing Tweepy

___________________________________________ 

Tweepy is an easy-to-use Python library for accessing the Twitter API. The API class provides access to the entire twitter RESTful API methods. Each method can accept various parameters and return responses. Using the API, you can programmatically access things like tweets, spaces, and lists. 

To install tweepy, you can use the following command pip install tweepy

More information about other ways to install tweepy can be found here .

___________________________________________

Setting up Twitter API Access

___________________________________________

The Twitter API can be used to programmatically retrieve and analyze Twitter data. In order to set be able to call the API within scripts we write, the firs thing we need to do is set up a twitter to developer project. 

Setting up a Twitter Developer Project

To do so, follow these steps:

  1. In order to get started with the new Twitter API, you need a developer account. If you do not have one yet, you can sign up for one.
  2. Next, in the developer portal, create a new Project. Once the project is created, you will need to connect it to an App. AN App is just container for your API Keys that you need in order to make an HTTP request to the Twitter API.

Once you have created a developer project, your developer dashboard should look something like this. 

API Access Levels

The API has varying levels of access, where the level of access a user has dictates the amount they can use the API and the types of information they can access. 

At the time of writing this article, there are 3 levels of access for the API. I highlight some of key differences below. 

  • Essential
    • Developer accounts come standard with this level of access
    • Retrieve up to 500k Tweets per month
    • Only supports app-only and user context authentication methods
  • Elevated: 
    • Retrieve up to 2 million Tweets per month
    • Supports all authentication methods 
  • Academic Research:
    • Retrieve up to 10 million Tweets per month
    • Access to full tweet archive & advanced filter operators
    • Supports all authentication methods 

For this project, we will request developer access. This will enable us to collect more tweets for our training data. Academic research would be nice, but it is hard to get. API users can apply for elevated access within the twitter developer portal

___________________________________________

Using Tweepy to Download Tweets

___________________________________________ 

At this point, we are ready to use tweepy to download tweet extracts for our project. We have the correct access and have our project setup. The completed extraction script can be found here, but we will walk through each step in greater detail below. As a disclaimer, this script is not written in such a way that it can easily handle errors. You can modify this script to have better error handling if you wish. 

Required Imports

For this script, we will import these required packages 

import tweepy as tw
import pandas as pd
import logging
import os
import datetime

 

Tweepy Authentication

Next, we want need to point our script to the app we set up in the previous steps. To do this, we will use app only authentication. You can find your BEARER_TOKEN within the twitter developer portal. I stored this token within my environment variables so it can easily interact with Github actions in the next step of this project. If you have never worked with environment variables before, this article provides a helpful overview of how to use environment variables within Python scripts. 

If your BEARER_TOKEN is stored within your environment variables, you can access it within your script in the following way.  Additionally, we set up some basic logging here so we can see what the script is doing as it runs. 

BEARER_TOKEN = os.getenv("BEARER_TOKEN")

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

 

Now, we need to initialize our Tweepy client. This client will authenticate our script using the BEARER_TOKEN . We will do this via a function titled initalize_client. You can also see we use the logger to confirm the client was set up correctly. This function returns an authenticated tweepy client that will enable our script to interact with the API. 

def initalize_client() -> tw.Client:

    client = tw.Client(bearer_token=BEARER_TOKEN)

    logger.info(" Client authenticated and initialized.")

    return client

 

Searching Twitter

Now that our client has been set up, we will pass it to a function titled search. The signature for this function is def search(client: tw.Client, query: str, max_results: int, limit: int). We will briefly describe each of these inputs below. 

  • client: Authenticated twitter client we will use to execute our query. This is created in the initalize_client​​​ function.
  • query: The twitter search query. Using elevated access, we have some the ability to use some basic operators. These operators can be used to filter the results of our twitter query. 
  • max_results: Maximum number of tweets to return per page 
  • limit: Total number of tweets to return (number of pages to query = total number of tweets divided by max results per page). 

There are multiple piece of our search function.

  • Paginator
  • Tweet filter 
  • Writing to dataframe

We will look at each of these components in more detail. 

Paginator

We will use the paginator to search recent tweets using our search criteria. Specifically, the lines of code will look like this 

tweets = tw.Paginator(
    client.search_recent_tweets,
    query,
    tweet_fields=["context_annotations", "created_at"],
    max_results=max_results,
).flatten(limit=limit)

 

We tell our client to hit the search recent tweets endpoints. We will pass a query to the endpoint, specifying what we want to search for. We will limit our search to max_results per page, and collect limit tweets in total. Additionally, we will pull in some additional information available for tweet objects (created_at, context_annotations). 

For this project, the query we pass to the paginator is #nfl -is:reply -is:retweet lang:en -has:media. This could probably be improved if you really wanted to spend time here. We use some basic operators to ensure our tweets are in english and exclude media (GIFs, videos, etc).

The paginator will return generator of tweet objects

Tweet Filtering

Tweet objects come with some build in context annotations we can use to eliminate tweets from our query that seem unrelated. In our example, since we are looking for tweets related to the NFL, we limit our results to relevant annotations. 

We iterate over each tweet object in our paginator result, and append any relevant tweet to a list so that we can store our results in a dataframe. 

relevant_tweets = []
created_at = []

for tweet in tweets:

    if len(tweet.context_annotations) <= 0:

        pass

    else:

        for context_annotation in tweet.context_annotations:

            if "entity" in context_annotation:

                if context_annotation["entity"]["name"].lower() in [
                    "nfl",
                    "nfl football",
                    "gambling",
                    "american football",
                    "Sports betting",
                ]:

                    relevant_tweets = relevant_tweets + [tweet.text]
                    created_at = created_at + [tweet.created_at]

 

Storing Results in Pandas Dataframe

To make things easy, we will have our search function return a pandas dataframe containing the tweet and the date the tweet was created. 

df = pd.DataFrame(relevant_tweets)

df.columns = ["tweet"]

df["created_at"] = created_at

df.drop_duplicates(inplace=True)

 

A Complete Search Function

Putting it all together, our search function now looks like this 

def search(client: tw.Client, query: str, max_results: int, limit: int) -> pd.DataFrame:

    tweets = tw.Paginator(
        client.search_recent_tweets,
        query,
        tweet_fields=["context_annotations", "created_at"],
        max_results=max_results,
    ).flatten(limit=limit)

    logger.info(" Tweets returned.")

    relevant_tweets = []
    created_at = []

    for tweet in tweets:

        if len(tweet.context_annotations) <= 0:

            pass

        else:

            for context_annotation in tweet.context_annotations:

                if "entity" in context_annotation:

                    if context_annotation["entity"]["name"].lower() in [
                        "nfl",
                        "nfl football",
                        "gambling",
                        "american football",
                        "Sports betting",
                    ]:

                        relevant_tweets = relevant_tweets + [tweet.text]
                        created_at = created_at + [tweet.created_at]

    logger.info(" Relevant tweets extracted")

    df = pd.DataFrame(relevant_tweets)

    df.columns = ["tweet"]

    df["created_at"] = created_at

    df.drop_duplicates(inplace=True)

    logger.info(" Sent relevant tweets to DF.")

    return df

The resulting dataframe will look something like this. 

___________________________________________

Executing the Script

___________________________________________

Finally, we combine all of the functions detailed above to run our script. 

  • Initialize client establishes our connection to the twitter API
  • Search returns a dataframe of tweets of interest 
  • We write the dataframe to a parquet file in the data folder of this repo. As I will write in and future blog post, we will use Github actions to run this script daily and create a new parquet file for each day. With that in mind, that is why we name the file as tweets_<date>.parquet
if __name__ == "__main__":

    working_dir = os.getcwd()

    data_dir = os.path.join(working_dir, "nfl_tweets/data/")

    logger.info(" Starting script.")

    client = initalize_client()

    relevant_tweets = search(
        client,
        "#nfl -is:reply -is:retweet lang:en -has:media",
        max_results=100,
        limit=3000,
    )

    file_name = "tweets_" + str(datetime.date.today()).replace("-", "_") + ".parquet"

    if os.path.exists(data_dir):

        relevant_tweets.to_parquet(os.path.join(data_dir, file_name))

        logger.info(
            f" Wrote {relevant_tweets.shape[0]} rows and {relevant_tweets.shape[1]} columns to {file_name}"
        )

    else:

        os.mkdir(data_dir)

        logger.info(f" Created {data_dir}")

        relevant_tweets.to_parquet(os.path.join(data_dir, file_name))

        logger.info(
            f" Wrote {relevant_tweets.shape[0]} rows and {relevant_tweets.shape[1]} columns to {file_name}"
        )

 

Feel free to reach out with any questions using the contact page or hitting me up on any of my social links!