Home > other >  How can I remove words from a vector of strings?
How can I remove words from a vector of strings?

Time:04-12

I'm new in python and I need to remove part of the file names in this vector.

I have been trying something like:

for x in documents:
   x.replace("Sint", "")

But I'm not able to do it all at once.

I have this vector:

documents = ['SintEstatuto1009908_17032016.rtf.txt', 'SintEstatuto16545345_15042016.rtf.txt', 'Estatuto124452336145_02052016.rtf.txt', 'SintEstatuto1645649_04042014.rtf.txt', 'MartEstatuto2592451_20072011.rtf.txt', 'Estatuto77845645858_29645615.rtf.txt', 'Estatuto149453456678_2547042016.rtf.txt', 'BrewEstatuto128634565661_14042014.rtf.txt', 'MartEstatuto11454536186_26022014.rtf.txt', 'MartEstatuto1635456456462_09042016.rtf.txt', 'SintEstatuto64565468987_22012015.rtf.txt', 'ColdEstatuto9645668602_18042016.rtf.txt', 'SintEstatuto1374534196_26032013.rtf.txt', 'SintEstatuto12964456455654040_22122008.rtf.txt', 'SintEstatuto1559914_27042016.rtf.txt', 'SintEstatuto145645152097_24042015.rtf.txt', 'MartEstatuto01064590027_21082015.rtf.txt', 'SintEstatuto1060307_04032016.rtf.txt', 'SintEstatuto8404454566046_18102014.rtf.txt', 'ColdEstatuto123545345921_30042013.rtf.txt', 'BrewEstatuto45656456791_07032015.rtf.txt', 'BrewEstatuto129754345353_29042011.rtf.txt', 'MartEstatuto1526456924_14062016.rtf.txt', 'MartEstatuto1524536924_03042014.rtf.txt', 'SintEstatuto80233287_20032016.rtf.txt', 'SintEstatuto1604998_23032015.rtf.txt', 'SintEstatuto4295435438890_22112013.rtf.txt', 'BrewEstatuto991778678639_24042014.rtf.txt', 'BrewEstatuto1330354387_1045343082011.rtf.txt']

And I want to remove this words:

names = ['Sint', 'Mart', 'Cold', 'Brew']

So I want this result:

documents = ['Estatuto1009908_17032016.rtf.txt', 'Estatuto16545345_15042016.rtf.txt', 'Estatuto124452336145_02052016.rtf.txt', 'Estatuto1645649_04042014.rtf.txt', 'Estatuto2592451_20072011.rtf.txt', 'Estatuto77845645858_29645615.rtf.txt', 'Estatuto149453456678_2547042016.rtf.txt', 'Estatuto128634565661_14042014.rtf.txt', 'Estatuto11454536186_26022014.rtf.txt', 'Estatuto1635456456462_09042016.rtf.txt', 'Estatuto64565468987_22012015.rtf.txt', 'Estatuto9645668602_18042016.rtf.txt', 'Estatuto1374534196_26032013.rtf.txt', 'Estatuto12964456455654040_22122008.rtf.txt', 'Estatuto1559914_27042016.rtf.txt', 'Estatuto145645152097_24042015.rtf.txt', 'Estatuto01064590027_21082015.rtf.txt', 'Estatuto1060307_04032016.rtf.txt', 'Estatuto8404454566046_18102014.rtf.txt', 'Estatuto123545345921_30042013.rtf.txt', 'Estatuto45656456791_07032015.rtf.txt', 'Estatuto129754345353_29042011.rtf.txt', 'Estatuto1526456924_14062016.rtf.txt', 'Estatuto1524536924_03042014.rtf.txt', 'Estatuto80233287_20032016.rtf.txt', 'Estatuto1604998_23032015.rtf.txt', 'Estatuto4295435438890_22112013.rtf.txt', 'Estatuto991778678639_24042014.rtf.txt', 'Estatuto1330354387_1045343082011.rtf.txt']

How can I do it?

CodePudding user response:

You could build a regex alternation of the keywords to remove, then use re.sub:

names = ['Sint', 'Mart', 'Cold', 'Brew']
regex = r'^(?:'   r'|'.join(names)   r')'
documents = ['SintEstatuto1009908_17032016.rtf.txt', 'SintEstatuto16545345_15042016.rtf.txt', 'Estatuto124452336145_02052016.rtf.txt', 'SintEstatuto1645649_04042014.rtf.txt', 'MartEstatuto2592451_20072011.rtf.txt', 'Estatuto77845645858_29645615.rtf.txt', 'Estatuto149453456678_2547042016.rtf.txt', 'BrewEstatuto128634565661_14042014.rtf.txt', 'MartEstatuto11454536186_26022014.rtf.txt', 'MartEstatuto1635456456462_09042016.rtf.txt', 'SintEstatuto64565468987_22012015.rtf.txt', 'ColdEstatuto9645668602_18042016.rtf.txt', 'SintEstatuto1374534196_26032013.rtf.txt', 'SintEstatuto12964456455654040_22122008.rtf.txt', 'SintEstatuto1559914_27042016.rtf.txt', 'SintEstatuto145645152097_24042015.rtf.txt', 'MartEstatuto01064590027_21082015.rtf.txt', 'SintEstatuto1060307_04032016.rtf.txt', 'SintEstatuto8404454566046_18102014.rtf.txt', 'ColdEstatuto123545345921_30042013.rtf.txt', 'BrewEstatuto45656456791_07032015.rtf.txt', 'BrewEstatuto129754345353_29042011.rtf.txt', 'MartEstatuto1526456924_14062016.rtf.txt', 'MartEstatuto1524536924_03042014.rtf.txt', 'SintEstatuto80233287_20032016.rtf.txt', 'SintEstatuto1604998_23032015.rtf.txt', 'SintEstatuto4295435438890_22112013.rtf.txt', 'BrewEstatuto991778678639_24042014.rtf.txt', 'BrewEstatuto1330354387_1045343082011.rtf.txt']
output = [re.sub(regex, '', x) for x in documents]
print(output)

This prints:

['Estatuto1009908_17032016.rtf.txt', 'Estatuto16545345_15042016.rtf.txt',
 'Estatuto124452336145_02052016.rtf.txt', ..., 'Estatuto1330354387_1045343082011.rtf.txt']

CodePudding user response:

One option is to use removeprefix:

from functools import reduce
out = [reduce(lambda x, y: x.removeprefix(y), names, item) for item in documents]

The same code with an explicit loop:

out = []
for item in documents:
    for name in names:
        item = item.removeprefix(name)
    out.append(item)
  • Related