I am converting json_response
to a dataframe by using the following commands:
df = pd.DataFrame(columns=["created_at", "username", "description", "tweet_id"]) #an empty dataframe to save data
data_nested = pd.json_normalize(json_response['data'])
df_temp = data_nested[["created_at", "username", "description"]].copy()
df = pd.concat([df, df_temp], ignore_index=True)
df.reset_index(inplace=True, drop=True)
Following is my sample json_response
:
{
"data": [
{
"created_at": "2020-01-01T12:24:45.000Z",
"description": "This is a sample description",
"id": "12345678",
"name": "Sample Name",
"username": "sample_name"
}
],
"meta": {
"next_token": "sample_token",
"result_count": 1
}
}
This response is a result of querying "Retweeted_by" endpoint of Twitter API V2. I am trying to save "tweet_id" against each response in the loop (to understand which resulting row corresponds to which requesting tweet_id) by doing -> df['tweet_id'] = tweet_id
. I understand that by using this, last tweet_id will replace everything else in the column.
I tried to do the following as well using index:
idx = df["username"].last_valid_index()
if pd.isnull(idx) or idx is None:
df.loc[0, "tweet_id"] = tweet_id
else:
df.loc[idx 1, "tweet_id"] = tweet_id
But this fails as well because if result_count
in json_response > 1, it will save tweet_id
at the next row leaving previous ones as NaN
.
Can someone please suggest a solution? Thank you.
CodePudding user response:
Based on our exchange in the comments here is my proposed solution:
tweet_id_list = [1,2,3] # a list of all of your tweet ids
# here you will start looping through each id, and getting retweets.
# You could make this async but I would be careful since token limits are very
# strict on twitter. They can disable it if you go over the limit a lot.
all_dfs=[]
for tweet_id in tweet_id_list:
response = requests.post("url/tweet_id")
json_response = json.loads(response.text)
temp_df = pd.DataFrame.from_records(json_response['data'])
temp_df['tweet_id'] = tweet_id
all_dfs.append(temp_df)
# if you want to then have one big table with all the retweets and tweet_ids
# simply do:
df = pd.concat(all_dfs)
Just a bit of explanation.
You are creating a dataframe for each tweet_id retweets (temp_df). You are also creating an extra column in that dataframe called tweet_id
. When you assign a value to a dataFrame column it propagates it to each row of said df.
You are then carefully collecting all the dataframes for each tweet_id
into a list all_dfs
.
After you exit the loop you are left with a list of dataframes. If you want to have one big table you concatenate them as a I have shown in the code.