Downloading NFL Tweets
This blog post outlines how we will use tweepy to develop a dataset for our NFL tweets project. This is just one component of a larger project
___________________________________________
Installing Tweepy
___________________________________________
Tweepy is an easy-to-use Python library for accessing the Twitter API. The API class provides access to the entire twitter RESTful API methods. Each method can accept various parameters and return responses. Using the API, you can programmatically access things like tweets, spaces, and lists.
To install tweepy, you can use the following command pip install tweepy
More information about other ways to install tweepy can be found here .
___________________________________________
Setting up Twitter API Access
___________________________________________
The Twitter API can be used to programmatically retrieve and analyze Twitter data. In order to set be able to call the API within scripts we write, the firs thing we need to do is set up a twitter to developer project.
Setting up a Twitter Developer Project
To do so, follow these steps:
- In order to get started with the new Twitter API, you need a developer account. If you do not have one yet, you can sign up for one.
- Next, in the developer portal, create a new Project. Once the project is created, you will need to connect it to an App. AN App is just container for your API Keys that you need in order to make an HTTP request to the Twitter API.
Once you have created a developer project, your developer dashboard should look something like this.
API Access Levels
The API has varying levels of access, where the level of access a user has dictates the amount they can use the API and the types of information they can access.
At the time of writing this article, there are 3 levels of access for the API. I highlight some of key differences below.
- Essential
- Developer accounts come standard with this level of access
- Retrieve up to 500k Tweets per month
- Only supports app-only and user context authentication methods
- Elevated:
- Retrieve up to 2 million Tweets per month
- Supports all authentication methods
- Academic Research:
- Retrieve up to 10 million Tweets per month
- Access to full tweet archive & advanced filter operators
- Supports all authentication methods
For this project, we will request developer access. This will enable us to collect more tweets for our training data. Academic research would be nice, but it is hard to get. API users can apply for elevated access within the twitter developer portal.
___________________________________________
Using Tweepy to Download Tweets
___________________________________________
At this point, we are ready to use tweepy to download tweet extracts for our project. We have the correct access and have our project setup. The completed extraction script can be found here, but we will walk through each step in greater detail below. As a disclaimer, this script is not written in such a way that it can easily handle errors. You can modify this script to have better error handling if you wish.
Required Imports
For this script, we will import these required packages
import tweepy as tw
import pandas as pd
import logging
import os
import datetime
Tweepy Authentication
Next, we want need to point our script to the app we set up in the previous steps. To do this, we will use app only authentication. You can find your BEARER_TOKEN
within the twitter developer portal. I stored this token within my environment variables so it can easily interact with Github actions in the next step of this project. If you have never worked with environment variables before, this article provides a helpful overview of how to use environment variables within Python scripts.
If your BEARER_TOKEN
is stored within your environment variables, you can access it within your script in the following way. Additionally, we set up some basic logging here so we can see what the script is doing as it runs.
BEARER_TOKEN = os.getenv("BEARER_TOKEN")
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
Now, we need to initialize our Tweepy client. This client will authenticate our script using the BEARER_TOKEN
. We will do this via a function titled initalize_client
. You can also see we use the logger to confirm the client was set up correctly. This function returns an authenticated tweepy client that will enable our script to interact with the API.
def initalize_client() -> tw.Client:
client = tw.Client(bearer_token=BEARER_TOKEN)
logger.info(" Client authenticated and initialized.")
return client
Searching Twitter
Now that our client has been set up, we will pass it to a function titled search. The signature for this function is def search(client: tw.Client, query: str, max_results: int, limit: int)
. We will briefly describe each of these inputs below.
- client: Authenticated twitter client we will use to execute our query. This is created in the
initalize_client
function. - query: The twitter search query. Using elevated access, we have some the ability to use some basic operators. These operators can be used to filter the results of our twitter query.
- max_results: Maximum number of tweets to return per page
- limit: Total number of tweets to return (number of pages to query = total number of tweets divided by max results per page).
There are multiple piece of our search
function.
- Paginator
- Tweet filter
- Writing to dataframe
We will look at each of these components in more detail.
Paginator
We will use the paginator to search recent tweets using our search criteria. Specifically, the lines of code will look like this
tweets = tw.Paginator(
client.search_recent_tweets,
query,
tweet_fields=["context_annotations", "created_at"],
max_results=max_results,
).flatten(limit=limit)
We tell our client
to hit the search recent tweets endpoints. We will pass a query to the endpoint, specifying what we want to search for. We will limit our search to max_results
per page, and collect limit
tweets in total. Additionally, we will pull in some additional information available for tweet objects (created_at
, context_annotations
).
For this project, the query we pass to the paginator is #nfl -is:reply -is:retweet lang:en -has:media
. This could probably be improved if you really wanted to spend time here. We use some basic operators to ensure our tweets are in english and exclude media (GIFs, videos, etc).
The paginator will return generator of tweet objects.
Tweet Filtering
Tweet objects come with some build in context annotations we can use to eliminate tweets from our query that seem unrelated. In our example, since we are looking for tweets related to the NFL, we limit our results to relevant annotations.
We iterate over each tweet object in our paginator result, and append any relevant tweet to a list so that we can store our results in a dataframe.
relevant_tweets = []
created_at = []
for tweet in tweets:
if len(tweet.context_annotations) <= 0:
pass
else:
for context_annotation in tweet.context_annotations:
if "entity" in context_annotation:
if context_annotation["entity"]["name"].lower() in [
"nfl",
"nfl football",
"gambling",
"american football",
"Sports betting",
]:
relevant_tweets = relevant_tweets + [tweet.text]
created_at = created_at + [tweet.created_at]
Storing Results in Pandas Dataframe
To make things easy, we will have our search
function return a pandas dataframe containing the tweet and the date the tweet was created.
df = pd.DataFrame(relevant_tweets)
df.columns = ["tweet"]
df["created_at"] = created_at
df.drop_duplicates(inplace=True)
A Complete Search Function
Putting it all together, our search function now looks like this
def search(client: tw.Client, query: str, max_results: int, limit: int) -> pd.DataFrame:
tweets = tw.Paginator(
client.search_recent_tweets,
query,
tweet_fields=["context_annotations", "created_at"],
max_results=max_results,
).flatten(limit=limit)
logger.info(" Tweets returned.")
relevant_tweets = []
created_at = []
for tweet in tweets:
if len(tweet.context_annotations) <= 0:
pass
else:
for context_annotation in tweet.context_annotations:
if "entity" in context_annotation:
if context_annotation["entity"]["name"].lower() in [
"nfl",
"nfl football",
"gambling",
"american football",
"Sports betting",
]:
relevant_tweets = relevant_tweets + [tweet.text]
created_at = created_at + [tweet.created_at]
logger.info(" Relevant tweets extracted")
df = pd.DataFrame(relevant_tweets)
df.columns = ["tweet"]
df["created_at"] = created_at
df.drop_duplicates(inplace=True)
logger.info(" Sent relevant tweets to DF.")
return df
The resulting dataframe will look something like this.
___________________________________________
Executing the Script
___________________________________________
Finally, we combine all of the functions detailed above to run our script.
- Initialize client establishes our connection to the twitter API
- Search returns a dataframe of tweets of interest
- We write the dataframe to a parquet file in the data folder of this repo. As I will write in and future blog post, we will use Github actions to run this script daily and create a new parquet file for each day. With that in mind, that is why we name the file as
tweets_<date>.parquet
if __name__ == "__main__":
working_dir = os.getcwd()
data_dir = os.path.join(working_dir, "nfl_tweets/data/")
logger.info(" Starting script.")
client = initalize_client()
relevant_tweets = search(
client,
"#nfl -is:reply -is:retweet lang:en -has:media",
max_results=100,
limit=3000,
)
file_name = "tweets_" + str(datetime.date.today()).replace("-", "_") + ".parquet"
if os.path.exists(data_dir):
relevant_tweets.to_parquet(os.path.join(data_dir, file_name))
logger.info(
f" Wrote {relevant_tweets.shape[0]} rows and {relevant_tweets.shape[1]} columns to {file_name}"
)
else:
os.mkdir(data_dir)
logger.info(f" Created {data_dir}")
relevant_tweets.to_parquet(os.path.join(data_dir, file_name))
logger.info(
f" Wrote {relevant_tweets.shape[0]} rows and {relevant_tweets.shape[1]} columns to {file_name}"
)
Feel free to reach out with any questions using the contact page or hitting me up on any of my social links!