Home > Software design >  how to split a list of strings with ref of other strings
how to split a list of strings with ref of other strings

Time:02-25

I need to remediate from a little error (60k over a full set of 2M items) when stack the dataset in a quite big file: somehow 2 consecutive records were merged together. I show you just few examples:

list1= ['^ZZ.LAAA.IS','ELJ.ISELK.IS','NID.ISNIE.IS','KMH.DUKNC.DU']
list2= ['.LA','.LK','.IS','.DU']

I need, I was thinking to split every record of list1 just in correspondence of the end of every eventual record of list2 when its overlap the string of the moment to split in list1. Records in list 1 are of different lengths while the ones in list2 might take 2-3 char long.

Example: 2nd record of list1: ELJ.ISELK.IS. When we overlap the 3rd record of list2(.IS) in its middle, we can split it in 2 new records that I need: ELJ.IS and ELK.IS

I was intent to use regex.. with something like:

for i in list1:
    for j in list2:
       new1 = re.sub(r'. (j)','',i)
       new2 = \1

but I'm unable to combine properly the statements...maybe exist a built in function or an other way to achieve the task..?

CodePudding user response:

The following re.findall approach seems to be working here:

list1 = ['ZZ.LAAA.IS','ELJ.ISELK.IS','NID.ISNIE.IS','KMH.DUKNC.DU']
list2 = ['.LA','.LK','.IS','.DU']
regex = r'^(.*?(?:'   r'|'.join([re.escape(x) for x in list2])   r'))(.*)$'

for item in list1:
    parts = re.findall(regex, item)
    print(parts)

This prints:

[('ZZ.LA', 'AA.IS')]
[('ELJ.IS', 'ELK.IS')]
[('NID.IS', 'NIE.IS')]
[('KMH.DU', 'KNC.DU')]

To be clear, here is the regex being used:

^
(.*?                         match and capture in group 1 until reaching
    (?:\.LA|\.LK|\.IS|\.DU)  the nearest .LA etc.
)
(.*)                         match and capture in group 2
$
  • Related