Home > Software design >  Subset a list in python on pre-defined string
Subset a list in python on pre-defined string

Time:01-16

I have some extremely large lists of character strings I need to parse. I need to break them into smaller lists based on a pre-defined character string, and I figured out a way to do it, but I worry that this will not be performant on my real data. Is there a better way to do this?

My goal is to turn this list:

['a', 'b', 'string_to_split_on', 'c', 'd', 'e', 'f', 'g', 'string_to_split_on', 'h', 'i', 'j', 'k', 'string_to_split_on']

Into this list:

[['a', 'b'], ['c', 'd', 'e', 'f', 'g'], ['h', 'i', 'j', 'k']]

What I tried:

# List that replicates my data.  `string_to_split_on` is a fixed character string I want to break my list up on 
my_list = ['a', 'b', 'string_to_split_on', 'c', 'd', 'e', 'f', 'g', 'string_to_split_on', 'h', 'i', 'j', 'k', 'string_to_split_on']

# Inspect List
print(my_list)

# Create empty lists to store dat ain 
new_list = []
good_letters = []

# Iterate over each string in the list
for i in my_list:

    # If the string is the seporator, append data to new_list, reset `good_letters` and move to the next string
    if i == 'string_to_split_on':
        new_list.append(good_letters)
        good_letters = []
        continue

    # Append letter to the list of good letters
    else:
        good_letters.append(i)



# I just like printing things thay because its easy to read
for item in new_list:
    print(item)
    print('-'*100)


### Output
['a', 'b', 'string_to_split_on', 'c', 'd', 'e', 'f', 'g', 'string_to_split_on', 'h', 'i', 'j', 'k', 'string_to_split_on']
['a', 'b']
----------------------------------------------------------------------------------------------------
['c', 'd', 'e', 'f', 'g']
----------------------------------------------------------------------------------------------------
['h', 'i', 'j', 'k']
----------------------------------------------------------------------------------------------------

CodePudding user response:

You can also use one line of code:

original_list = ['a', 'b', 'string_to_split_on', 'c', 'd', 'e', 'f', 'g', 'string_to_split_on', 'h', 'i', 'j', 'k', 'string_to_split_on']
split_string = 'string_to_split_on'

new_list = [sublist.split() for sublist in ' '.join(original_list).split(split_string) if sublist]
print(new_list)

This approach is more efficient when dealing with large data set:

import itertools

new_list = [list(j) for k, j in itertools.groupby(original_list, lambda x: x != split_string) if k]
print(new_list)

[['a', 'b'], ['c', 'd', 'e', 'f', 'g'], ['h', 'i', 'j', 'k']]
  • Related