I have a pandas dataframe as folllows,
import pandas as pd
df = pd.DataFrame({'text': ['set an alarm for [time : two hours from now]','wake me up at [time : nine am] on [date : friday]','check email from [person : john]']})
print(df)
original dataframe
text
0 set an alarm for [time : two hours from now]
1 wake me up at [time : nine am] on [date : friday]
2 check email from [person : john]
I would like to repeat the list and the labels (date, time, and person) for all the values inside the lists if the value inside the list is more than one. so the desired output is,
desired output:
new_text
0 set an alarm for [time : two] [time : hours] [time : from] [time : now]
1 wake me up at [time : nine] [time : am] on [date : friday]
2 check email from [person : john]
I have so far tried to separate the lists from the original column, but do not know how to continue.
df['separated_list'] = df.text.str.split(r"\s(?![^[]*])|[|]").apply(lambda x: [y for y in x if '[' in y])
CodePudding user response:
You can use a regex with a custom function as replacement:
df['new_text'] = df.text.str.replace(
r"\[([^\[\]]*?)\s*:\s*([^\[\]]*)\]",
lambda m: ' '.join([f'[{m.group(1)} : {x}]'
for x in m.group(2).split()]), # new chunk for each word
regex=True)
output:
text new_text
0 set an alarm for [time : two hours from now] set an alarm for [time : two] [time : hours] [time : from] [time : now]
1 wake me up at [time : nine am] on [date : friday] wake me up at [time : nine] [time : am] on [date : friday]
2 check email from [person : john] check email from [person : john]
CodePudding user response:
find the [] using look behind and ahead, use a repeating capture group to get the string contents then split the contents using :
df = pd.DataFrame({'text': ['set an alarm for [time : two hours from now]','wake me up at [time : nine am] on [date : friday]','check email from [person : john]']})
#print(df)
data=df['text']
for item in data:
print(item)
matches=re.findall(r'(?<=\[)(?:[\w \s*] \:[\w \s*] )(?=\])', item)
for match in matches:
parts=match.split(":")
print(parts)
output:
set an alarm for [time : two hours from now]
['time ', ' two hours from now']
wake me up at [time : nine am] on [date : friday]
['time ', ' nine am']
['date ', ' friday']
check email from [person : john]
['person ', ' john']