Home > Software engineering >  How to remove all occurrences of a substring in a list of lists
How to remove all occurrences of a substring in a list of lists

Time:02-15

I have a list of a list of strings in Python, after reading from a .DAT file, as the following:

datContent = [['\x00\x00\x00\x00\x00\x00NGDUID\x00\x00\x00\x00\x00C\SAMPLEx00\x00\x00\x00', 'x00\x00\x00\x00NGDUID\x00\x00\x00\x00\x00C\SAMPLE2x00\x00\x00\x00'],
['\x00\x00x00\x00CY\x0059British', 'Columbia', '/', 'Colombie-Britannique\x00\x00\x00\', '\x00\x00\x00\x00212TroisRivieres-Montreal\x00\x00\x00\x00\'], 
...] #Sublist contains strings

I am trying to parse the datContent so that it basically removes all the \x00\ terms. This is what I tried so far:

for i in range(len(datContent)):
      datContent[i]=[s.replace("\\x00\\", "") for s in datContent[i]] 

This piece of code doesn't seem to remove those terms. Preferably, I would want a list of list with all elements besides the x00 elements:

datContent=[['NGDUID', 'SAMPLE', 'NGDUID', 'SAMPLE2'], ['CY', '59BritishColumbia/Columbie-Britannique', 'TroisRivieres-Montreal'], ..]]

When I run a for loop through the list of lists and print each element:

for i in datContent[0]:
      print(i) #this prints the correct elements (skips every x00 element)

Any suggestions?

CodePudding user response:

A step-by-step approach would involve creating a new list of lists as follows:

datContent = [['\x00\x00\x00\x00\x00\x00NGDUID\x00\x00\x00\x00\x00C\SAMPLEx00\x00\x00\x00', '\x00\x00\x00\x00NGDUID\x00\x00\x00\x00\x00C\SAMPLE2x00\x00\x00\x00'], [
    '\x00\x00\x00\x00CY\x0059British', 'Columbia', '/', 'Colombie-Britannique\x00\x00\x00', '\x00\x00\x00\x00212TroisRivieres-Montreal\x00\x00\x00\x00']]

newDatContent = []
for row in datContent:
    newRow = []
    for string in row:
        newRow.append(string.replace('\x00', ''))
    newDatContent.append(newRow)

print(newDatContent)

Output:

[['NGDUIDC\\SAMPLEx00', 'NGDUIDC\\SAMPLE2x00'], ['CY59British', 'Columbia', '/', 'Colombie-Britannique', '212TroisRivieres-Montreal']]

CodePudding user response:

To get a list of list with all elements besides the x00 elements, you need to use the x00 pattern as delimiter.

A step-by-step using RE:

import re

def convertDat(datContent):
  result = []
  for dat in datContent:
    #Convert list in string
    dat = str(dat)
    #Remove the list delimiters chars: , ' [ ]
    dat = re.sub( r"[,'\[\] ]" , r"",dat)
    #Replace the x00 patterns to delimiter ,
    dat = re.sub( r"\\x00C\\|\\x00|x00|\\", r",", dat)
    #Recreate the list
    dat = dat.split(",")
    #Remove empty strings
    dat = list(filter(None,dat))
    result.append(dat)
  return result

newContent = convertDat(datContent)
print(newContent)

Output

[['NGDUID', 'SAMPLE', 'NGDUID', 'SAMPLE2'], ['CY', '59BritishColumbia/Colombie-Britannique', '212TroisRivieres-Montreal']]
  • Related