Home > Net >  Replace everything after the last occurrence of the given keywords in a string
Replace everything after the last occurrence of the given keywords in a string

Time:08-01

I need to clean a large number of documents by removing the redundant parts using a list of keywords.

[key1, key2, key3, ..., key20]

These keywords (anchor words) could randomly appear at different positions of the text for all documents multiple times. I need to remove everything after the appearance of the last keyword, which technically would be the closest one to the end of the string. Here are a few examples:

example1 = “aaa key1 bbb key5 ccc key3 ddd eee”

For this example, "ddd eee" should be removed because it is after the last keyword (key3).

example2 = "aaaa key10 vvv key5 nnnn key4 mnb bnn” 

In this case, >> mnb bnn should be removed as it is after key4, which is the last one

example3 = "rrrr key10 bbbb key8 nnnn key6” 

Nothing should be removed in this case.

To solve this problem, I used a regex expression as follows

s = re.sub('(key1|key2|key3|...)(.*?)$', r'\1', s)

I used a lazy matching to replace the one with the shortest length but it didn't work.

CodePudding user response:

You were on the right track. We can phrase your problem by greedily matching and capturing all content up until the last key, then replacing with just the first capture group and the key.

inp = ["aaa key1 bbb key5 ccc key3 ddd eee", "aaaa key10 vvv key5 nnnn key4 mnb bnn", "rrrr key10 bbbb key8 nnnn key6"]
keys = ["key1", "key2", "key3", "key4", "key5", "key6", "key7", "key8", "key9", "key10"]
regex = r'\b('   r'|'.join(keys)   r')\b'
output = [re.sub(r'(.*) '   regex   r'.*', r'\1 \2', x) for x in inp]
print(output)

This prints:

['aaa key1 bbb key5 ccc key3',
 'aaaa key10 vvv key5 nnnn key4',
 'rrrr key10 bbbb key8 nnnn key6']
  • Related