I am facing the following problem. My Dataframe is as follows,
I want to create 3 dataset from this dataframe,
- Response column stays and need context with the first string, so Tweet1, Tweet3 ,Tweet6,Tweet7 and Tweet11
- Response column stays and need context with the first and second string, so Tweet1,Tweet2, Tweet3, Tweet4,Tweet6, Tweet7, Tweet8,Tweet11 and Tweet12.
- Response column stays and need context with the first, second and third string, so, Tweet1,Tweet2, Tweet3,Tweet4,Tweet5,Tweet6,Tweet7,Tweet8,Tweet9,Tweet11 and Tweet12
All the tweets in the context column are in a list as shown above and they are separated using a comma.
I appreciate your repsonse and comments.
CodePudding user response:
Based on your new information, I will now mimic the reading of the json file like this::
import pandas as pd
from io import StringIO
file_as_str="""
[
{"label":1, "response" : "resp_exmaple1", "context": ["tweet1,with comma", "tweet2"]},
{"label":0, "response" : "resp_exmaple2", "context": ["tweet3", "tweet4", "tweet5"]},
{"label":1, "response" : "resp_exmaple3", "context": ["tweet6, with comma"]},
{"label":1, "response" : "resp_exmaple4", "context": ["tweet7", "Tweet8", "Tweet9", "Tweet10"]},
{"label":0, "response" : "resp_exmaple5", "context": ["tweet11", "Tweet12"]}
]
"""
tweets_json = StringIO(file_as_str)
The above string is only to mimic reading from file like this:
tweets = pd.read_json(tweets_json, orient='records')
If the structure is indeed is like my example, you should give orient='records', but if it is different you may need to pick another scheme. The dataframe now looks like:
label response context
0 1 resp_exmaple1 [tweet1,with comma, tweet2]
1 0 resp_exmaple2 [tweet3, tweet4, tweet5]
2 1 resp_exmaple3 [tweet6, with comma]
3 1 resp_exmaple4 [tweet7, Tweet8, Tweet9, Tweet10]
4 0 resp_exmaple5 [tweet11, Tweet12]
The difference is that the context column now contains lists of strings, so the comma's dont matter. Now you can easily make a selection of maximum number of tweets like this:
context = tweets["context"]
max_tweets = 2
new_context = list()
for tweet_list in context:
n_selection = min(len(tweet_list), max_tweets)
tweets_selection = tweet_list[:n_selection]
new_context.append(tweets_selection)
tweets["context"] = new_context
The result looks like
label response context
0 1 resp_exmaple1 [tweet1,with comma, tweet2]
1 0 resp_exmaple2 [tweet3, tweet4]
2 1 resp_exmaple3 [tweet6, with comma]
3 1 resp_exmaple4 [tweet7, Tweet8]
4 0 resp_exmaple5 [tweet11, Tweet12]