Create a dataframe using specific strings in a column from a parent dataframe-CodePudding

I am facing the following problem. My Dataframe is as follows,

I want to create 3 dataset from this dataframe,

Response column stays and need context with the first string, so Tweet1, Tweet3 ,Tweet6,Tweet7 and Tweet11
Response column stays and need context with the first and second string, so Tweet1,Tweet2, Tweet3, Tweet4,Tweet6, Tweet7, Tweet8,Tweet11 and Tweet12.
Response column stays and need context with the first, second and third string, so, Tweet1,Tweet2, Tweet3,Tweet4,Tweet5,Tweet6,Tweet7,Tweet8,Tweet9,Tweet11 and Tweet12

All the tweets in the context column are in a list as shown above and they are separated using a comma.

I appreciate your repsonse and comments.

CodePudding user response：

Based on your new information, I will now mimic the reading of the json file like this::

import pandas as pd
from io import StringIO

file_as_str="""
[
{"label":1, "response" : "resp_exmaple1", "context": ["tweet1,with comma", "tweet2"]},
{"label":0, "response" : "resp_exmaple2", "context": ["tweet3", "tweet4", "tweet5"]},
{"label":1, "response" : "resp_exmaple3", "context": ["tweet6, with comma"]},
{"label":1, "response" : "resp_exmaple4", "context": ["tweet7", "Tweet8", "Tweet9", "Tweet10"]},
{"label":0, "response" : "resp_exmaple5", "context": ["tweet11", "Tweet12"]}
]
"""
tweets_json = StringIO(file_as_str)

The above string is only to mimic reading from file like this:

tweets = pd.read_json(tweets_json, orient='records')

If the structure is indeed is like my example, you should give orient='records', but if it is different you may need to pick another scheme. The dataframe now looks like:

   label       response                            context
0      1  resp_exmaple1        [tweet1,with comma, tweet2]
1      0  resp_exmaple2           [tweet3, tweet4, tweet5]
2      1  resp_exmaple3               [tweet6, with comma]
3      1  resp_exmaple4  [tweet7, Tweet8, Tweet9, Tweet10]
4      0  resp_exmaple5                 [tweet11, Tweet12]

The difference is that the context column now contains lists of strings, so the comma's dont matter. Now you can easily make a selection of maximum number of tweets like this:

context = tweets["context"]

max_tweets = 2
new_context = list()

for tweet_list in context:
    n_selection = min(len(tweet_list), max_tweets)
    tweets_selection = tweet_list[:n_selection]
    new_context.append(tweets_selection)
tweets["context"] = new_context

The result looks like

   label       response                      context
0      1  resp_exmaple1  [tweet1,with comma, tweet2]
1      0  resp_exmaple2             [tweet3, tweet4]
2      1  resp_exmaple3         [tweet6, with comma]
3      1  resp_exmaple4             [tweet7, Tweet8]
4      0  resp_exmaple5           [tweet11, Tweet12]