Loop over a string and add the delimiter back to the substring-CodePudding

I'm trying to split a column with strings. I would like to split the string in each cell by 'Juice', but keep 'juice' if it is not in the last substring.

Example: df['value'] looks like below:

 1. applejuice, orangejuice, juice, applejuice, pineapple juice,  berriesjuice 
 2. carrotjuice, juice, pinapple juice, water, berriesjuice, juice

my output of the new colmn df['value2'] would look like this:

1. [applejuice, orangejuice, juice], [applejuice, pineapple ,juice], [berriesjuice]
2. [carrotjuice, juice], [pinapple juice], [water, berriesjuice, juice]

CodePudding user response：

Unclear why you need a dataframe, but start by splitting on commas, then iterate and check when the strings are equal to juice or not.

import re

lines = [
  'applejuice, orangejuice, juice, applejuice, pineapple juice, juice, berriesjuice',
  'carrotjuice, juice, pinapple juice, water, berriesjuice, juice'
]

def getSections(line):
  strings = re.split(',\\s*', line)
  
  sections = []
  section = []
  for x in strings:
      if x == 'juice':
          section.append(x)
          sections.append(section[:])
          section = []
      else:
          section.append(x)
  if len(section) > 0:
      sections.append(section)
      del section
  
  return sections

for s in map(getSections, lines):
  print(s)

[['applejuice', 'orangejuice', 'juice'], ['applejuice', 'pineapple juice', 'juice'], ['berriesjuice']]
[['carrotjuice', 'juice'], ['pinapple juice', 'water', 'berriesjuice', 'juice']]

From a list of lists, you can make a DataFrame, if you wanted.

CodePudding user response：

Applying this function to the value column should do the job. It starts by splitting on ', ' (note the space), and then makes new sublists every time it encounters 'juice' on its own.

def separate(string):
    substrings = [[]]
    for x in string.split(', '):
        substrings[-1].append(x)
        if x == 'juice':
            substrings.append([])
    return substrings

import pandas as pd
df = pd.DataFrame({'value' : [
    'applejuice, orangejuice, juice, applejuice, pineapple juice, juice, berriesjuice', 
    'carrotjuice, juice, pinapple juice, water, berriesjuice, juice'
    ]})
df['value2'] = df.value.apply(separate, axis=0)

I'm not sure about speed though.