Equivalent of str.split() for a list of strings?-CodePudding

Taking a list of strings split them into a list of string lists around a delimiting substring. Similar to how str.split() works on individual strings. Specifically removing the delimiting string.

This is a similar problem to both of the following questions:

However these do not remove the delimiting string and are for slightly different scenarios.

I worked from these answers to build the two solutions listed below as answers. I'm guessing there is a built in or common library to do this which I am not aware of.

Does anyone know a better way to go about this or have any suggestions to improve my current solutions.

Example Input:

s_input = \
["Action1,A,200,5",
"Phase1,B,100,1,2000",
"Action1,C,300,5",
"Phase2,B,100,1,500",
"Action1,C,400,5",
"Action2,C,10,5",
"Action3,C,10,5"]

Example Usage:

substring = 'Phase'
s_lines_split = split_listofstrings_bysubstring(s_input, substring)

Example Output:

s_lines_split = \
    [["Action1,A,200,5"],
    ["Action1,C,300,5"],
    ["Action1,C,400,5", "Action2,C,10,5", "Action3,C,10,5"]]

CodePudding user response：

A Basic Python Approach:

def split_listofstrings_bysubstring(s_list: list, substring: str) -> list:
    """
    Takes a list of strings and splits it into several lists of lists delimited by the string containing a substring
    """

    split_list=list()
    split_list.append(list())

    for line in s_list:
        if substring in line and len(split_list[-1])!=0: # current line is delimiter
            split_list.append(list())  # instantiate new sublist
            continue
        split_list[-1].append(line)
        
    return split_list

This is useful as it has no imports and you can easily pass parameters to make the function more general. E.G.:

def split_listofstrings_bysubstring(s_list: list, substring: str, delim_type: str='contains') -> list:
    """
    Takes a list of strings and splits it into several lists of lists delimited by the string containing a substring
    delim_type: 'contains', 'startswith', 'endswith'
    """

    split_list=list()
    split_list.append(list())

    if delim_type == 'contains':
        for line in s_list:
            if substring in line and len(split_list[-1])!=0: # current line is delimiter
                split_list.append(list())  # instantiate new sublist
                continue
            split_list[-1].append(line)
    
    if delim_type == 'startswith':
        for line in s_list:
            if line.startswith(substring) and len(split_list[-1])!=0: # current line is delimiter
                split_list.append(list())  # instantiate new sublist
                continue
            split_list[-1].append(line)
        
    elif delim_type == 'endswith':
        for line in s_list:
            if line.endswith(substring) and len(split_list[-1])!=0: # current line is delimiter
                split_list.append(list())  # instantiate new sublist
                continue
            split_list[-1].append(line)

    else:
        raise ValueError(f"Parameter passed to delim_type was not recognised. Passed: {delim_type}")

    return split_list

Itertools Groupby approach

Using itertools groupby with a helper class as originally mentioned in python splitting list by keyword. This method in its current form is not as neat as you have to add an additional step to remove the delimitting string. Note I did consider simply dropping the first index of each list but this fails in the case where the delimiting string is not the very first line.

from itertools import groupby
class GroupbyHelper(object):

    def __init__(self, val):
        self.val = val
        self.i = 0

    def __call__(self, val):
        self.i  = (self.val in val)
        return self.i

def split_listofstrings_bysubstring_groupby(s_list: list, substring: str) -> list:
    """
    Takes a list of strings and splits it into several lists of lists delimited by the string containing a substring
    """
    s_lines_split = [list(g) for k, g in groupby(s_list, key=GroupbyHelper(substring))]
    s_lines_split_clean = list()
    for line in s_lines_split:
        if substring in line[0]:
            s_lines_split_clean.append(line[1:])
        else:
            s_lines_split_clean.append(line)
    return s_lines_split_clean

Execution Time Comparison

Comparing these two approaches with the example input yields the expected answer for both. However the GroupBy Approach comes in at 2.2x longer execution time (avg 3.5us per loop) vs the basic approach (avg 1.6us per loop). Execution Time Test Code: https://pastebin.com/8Z5Kfybz

Personally I prefer the basic approach for its simplicity and extensibility. Thoughts?

CodePudding user response：

Based on Mark's answer I built out the below which seems to be the best approach in my opinion. This method takes 1.2x the execution time of the basic approach (average of 1.8us per loop on my hardware) which I think its a fair trade off for the improved readability.

The itertools.groupby() Documentation is here for anyone looking for an explanation.

from itertools import groupby

def split_listofstrings_bysubstring(s_list: list, substring: str, delim_type: str='contains') -> list:
    """
    Takes a list of strings and splits it into several lists of lists delimited by the string containing a substring
    delim_type: 'contains', 'startswith', 'endswith'
    """

    if delim_type == 'contains':
        return [list(g) for test, g in groupby(s_list, key=lambda s: substring in s) if not test]
    
    if delim_type == 'startswith':
        return [list(g) for test, g in groupby(s_list, key=lambda s: s.startswith(substring)) if not test]
        
    elif delim_type == 'endswith':
        return [list(g) for test, g in groupby(s_list, key=lambda s: s.endswith(substring)) if not test]
        
    else:
        raise ValueError(f"Parameter passed to delim_type was not recognised. Passed: {delim_type}")