Home > Net >  Regex to remove dots only after short words and not after long words
Regex to remove dots only after short words and not after long words

Time:02-18

I would like to remove dots from abbreviations in a Pandas dataframe but not if the dots are in between longer words. So 'l.t.d.' and 'ltd.' should result in 'ltd' but 'longword.' should remain the same.

The regex I now have is (?:\b\w{1,3})(\.). From this regex, I want to replace the result in group 1 by an empty string. How can I tell str.replace(r'(?:\b\w{1,3})(\.)', '') to consider only the second group?

CodePudding user response:

You can use

df['col'] = df['col'].str.replace(r'\b([a-zA-Z]{1,3})\.', r'\1', regex=True)
## Or, to account for any Unicode letters:
df['col'] = df['col'].str.replace(r'\b([^\W\d_]{1,3})\.', r'\1', regex=True)

See the regex demo. Details:

  • \b - word boundary
  • ([^\W\d_]{1,3}) - Group 1 (\1): one, two or three letters
  • \. - a dot.

The \1 in the replacement refers to the Group 1 value.

Note you should provide the regex=True argument to Series.str.replace to avoid the warning described in FutureWarning: The default value of regex will change from True to False in a future version.

  • Related