I have some extremely large lists of character strings I need to parse. I need to break them into smaller lists based on a pre-defined character string, and I figured out a way to do it, but I worry that this will not be performant on my real data. Is there a better way to do this?
My goal is to turn this list:
['a', 'b', 'string_to_split_on', 'c', 'd', 'e', 'f', 'g', 'string_to_split_on', 'h', 'i', 'j', 'k', 'string_to_split_on']
Into this list:
[['a', 'b'], ['c', 'd', 'e', 'f', 'g'], ['h', 'i', 'j', 'k']]
What I tried:
# List that replicates my data. `string_to_split_on` is a fixed character string I want to break my list up on
my_list = ['a', 'b', 'string_to_split_on', 'c', 'd', 'e', 'f', 'g', 'string_to_split_on', 'h', 'i', 'j', 'k', 'string_to_split_on']
# Inspect List
print(my_list)
# Create empty lists to store dat ain
new_list = []
good_letters = []
# Iterate over each string in the list
for i in my_list:
# If the string is the seporator, append data to new_list, reset `good_letters` and move to the next string
if i == 'string_to_split_on':
new_list.append(good_letters)
good_letters = []
continue
# Append letter to the list of good letters
else:
good_letters.append(i)
# I just like printing things thay because its easy to read
for item in new_list:
print(item)
print('-'*100)
### Output
['a', 'b', 'string_to_split_on', 'c', 'd', 'e', 'f', 'g', 'string_to_split_on', 'h', 'i', 'j', 'k', 'string_to_split_on']
['a', 'b']
----------------------------------------------------------------------------------------------------
['c', 'd', 'e', 'f', 'g']
----------------------------------------------------------------------------------------------------
['h', 'i', 'j', 'k']
----------------------------------------------------------------------------------------------------
CodePudding user response:
You can also use one line of code:
original_list = ['a', 'b', 'string_to_split_on', 'c', 'd', 'e', 'f', 'g', 'string_to_split_on', 'h', 'i', 'j', 'k', 'string_to_split_on']
split_string = 'string_to_split_on'
new_list = [sublist.split() for sublist in ' '.join(original_list).split(split_string) if sublist]
print(new_list)
This approach is more efficient when dealing with large data set:
import itertools
new_list = [list(j) for k, j in itertools.groupby(original_list, lambda x: x != split_string) if k]
print(new_list)
[['a', 'b'], ['c', 'd', 'e', 'f', 'g'], ['h', 'i', 'j', 'k']]