Home > Mobile >  pandas: how to duplicate a value for every substring in a column
pandas: how to duplicate a value for every substring in a column

Time:10-13

I have a pandas dataframe as folllows,

import pandas as pd

df = pd.DataFrame({'text': ['set an alarm for [time : two hours from now]','wake me up at [time : nine am] on [date : friday]','check email from [person : john]']})
print(df)

original dataframe

                                                text
0       set an alarm for [time : two hours from now]
1  wake me up at [time : nine am] on [date : friday]
2                   check email from [person : john]

I would like to repeat the list and the labels (date, time, and person) for all the values inside the lists if the value inside the list is more than one. so the desired output is,

desired output:

                                                new_text                                
0       set an alarm for [time : two] [time : hours] [time : from] [time : now]        
1  wake me up at [time : nine] [time : am] on [date : friday]  
2                   check email from [person : john]

I have so far tried to separate the lists from the original column, but do not know how to continue.

df['separated_list'] = df.text.str.split(r"\s(?![^[]*])|[|]").apply(lambda x: [y for y in x if '[' in y])

CodePudding user response:

You can use a regex with a custom function as replacement:

df['new_text'] = df.text.str.replace(
  r"\[([^\[\]]*?)\s*:\s*([^\[\]]*)\]",
  lambda m: ' '.join([f'[{m.group(1)} : {x}]'
                      for x in m.group(2).split()]), # new chunk for each word
  regex=True)

output:

                                                text                                                                 new_text
0       set an alarm for [time : two hours from now]  set an alarm for [time : two] [time : hours] [time : from] [time : now]
1  wake me up at [time : nine am] on [date : friday]               wake me up at [time : nine] [time : am] on [date : friday]
2                   check email from [person : john]                                         check email from [person : john]

regex demo

CodePudding user response:

find the [] using look behind and ahead, use a repeating capture group to get the string contents then split the contents using :

df = pd.DataFrame({'text': ['set an alarm for [time : two hours from now]','wake me up at [time : nine am] on [date : friday]','check email from [person : john]']})
#print(df)
data=df['text']
for item in data:
    print(item)
    matches=re.findall(r'(?<=\[)(?:[\w \s*] \:[\w \s*] )(?=\])', item)
    for match in matches:
        parts=match.split(":")
        print(parts)

output:

set an alarm for [time : two hours from now]
['time ', ' two hours from now']
wake me up at [time : nine am] on [date : friday]
['time ', ' nine am']
['date ', ' friday']
check email from [person : john]
['person ', ' john']
  • Related