Home > Net >  substring replacement based on condition -python
substring replacement based on condition -python

Time:12-07

I have a dataframe with a column containing string (sentence). This string has many camelcased abbreviations. There is another dictionary which has details of these abbreviations and their respective longforms.

For Example: Dictionary: {'ShFrm':'Shortform', 'LgFrm':'Longform' ,'Auto':'Automatik'} Dataframe columns has text like this : (for simplicity, each list entry is one row in dataframe) ['ShFrmLgFrm should be replaced Automatically','Auto', 'AutoLgFrm']

If i simply do replace using the dictionary, all replacements are correct except Automatically converts to 'Automatikmatically' in first text.

I tried using regex in the key values of dictionary with condition, replace the word only if has a space/start pf string/small alphabet before it and Capital letter/space/end of sentence after it : '(?:^|[a-z])ShFrm(?:[^A-Z]|$)', but it replaces the character before and after the middle string as well.

Could you please help me to modify the regex pattern such that it matches the abbreviations only if it has small letter before/is start of a word/space before and has capital alphabet after it/end of word/space after it and replaces only the middle word, and not the before and after characters

CodePudding user response:

You need to build an alternation-based regex from the dictionary keys and use a lambda expression as the replacement argument.

See the following Python demo:

import re
d = {'ShFrm':'Shortform', 'LgFrm':'Longform' ,'Auto':'Automatik'}
col = ['ShFrmLgFrm should be replaced Automatically','Auto', 'AutoLgFrm']
rx = r'(?:\b|(?<=[a-z]))(?:{})(?=[A-Z]|\b)'.format("|".join(d.keys()))
# => (?:\b|(?<=[a-z]))(?:ShFrm|LgFrm|Auto)(?=[A-Z]|\b)
print([re.sub(rx, lambda x: d[x.group()], v) for v in col])
# => ['ShortformLongform should be replaced Automatically', 'Automatik', 'AutomatikLongform']

In Pandas, you would use it like this:

df[col] = df[col].str.replace(rx, lambda x: d[x.group()], regex=True)

See the regex demo.

CodePudding user response:

You can use the lookahead function which matches a group after the main expression without including it in the result.

(?<=\b|[a-z])(ShFrm|LgFrm|Auto)(?=[A-Z]|\b)

That matches your requirements perfectly. Though python re only supports fixed-width positive lookbehind, we can change to negative lookbehind

rx=r"(?<![A-Z])(ShFrm|LgFrm|Auto)(?=[A-Z]|\b)"
re.findall(rx,"['ShFrmLgFrm should be replaced Automatically','Auto', 'AutoLgFrm']")

Out: ['ShFrm', 'LgFrm', 'Auto', 'Auto', 'LgFrm']

  • Related