I'm trying to split a column with strings. I would like to split the string in each cell by 'Juice', but keep 'juice' if it is not in the last substring.
Example: df['value'] looks like below:
1. applejuice, orangejuice, juice, applejuice, pineapple juice, berriesjuice
2. carrotjuice, juice, pinapple juice, water, berriesjuice, juice
my output of the new colmn df['value2'] would look like this:
1. [applejuice, orangejuice, juice], [applejuice, pineapple ,juice], [berriesjuice]
2. [carrotjuice, juice], [pinapple juice], [water, berriesjuice, juice]
CodePudding user response:
Unclear why you need a dataframe, but start by splitting on commas, then iterate and check when the strings are equal to juice
or not.
import re
lines = [
'applejuice, orangejuice, juice, applejuice, pineapple juice, juice, berriesjuice',
'carrotjuice, juice, pinapple juice, water, berriesjuice, juice'
]
def getSections(line):
strings = re.split(',\\s*', line)
sections = []
section = []
for x in strings:
if x == 'juice':
section.append(x)
sections.append(section[:])
section = []
else:
section.append(x)
if len(section) > 0:
sections.append(section)
del section
return sections
for s in map(getSections, lines):
print(s)
[['applejuice', 'orangejuice', 'juice'], ['applejuice', 'pineapple juice', 'juice'], ['berriesjuice']]
[['carrotjuice', 'juice'], ['pinapple juice', 'water', 'berriesjuice', 'juice']]
From a list of lists, you can make a DataFrame, if you wanted.
CodePudding user response:
Applying this function to the value
column should do the job. It starts by splitting on ', ' (note the space), and then makes new sublists every time it encounters 'juice' on its own.
def separate(string):
substrings = [[]]
for x in string.split(', '):
substrings[-1].append(x)
if x == 'juice':
substrings.append([])
return substrings
import pandas as pd
df = pd.DataFrame({'value' : [
'applejuice, orangejuice, juice, applejuice, pineapple juice, juice, berriesjuice',
'carrotjuice, juice, pinapple juice, water, berriesjuice, juice'
]})
df['value2'] = df.value.apply(separate, axis=0)
I'm not sure about speed though.