I would like to remove dots from abbreviations in a Pandas dataframe but not if the dots are in between longer words. So 'l.t.d.' and 'ltd.' should result in 'ltd' but 'longword.' should remain the same.
The regex I now have is (?:\b\w{1,3})(\.)
. From this regex, I want to replace the result in group 1 by an empty string. How can I tell str.replace(r'(?:\b\w{1,3})(\.)', '')
to consider only the second group?
CodePudding user response:
You can use
df['col'] = df['col'].str.replace(r'\b([a-zA-Z]{1,3})\.', r'\1', regex=True)
## Or, to account for any Unicode letters:
df['col'] = df['col'].str.replace(r'\b([^\W\d_]{1,3})\.', r'\1', regex=True)
See the regex demo. Details:
\b
- word boundary([^\W\d_]{1,3})
- Group 1 (\1
): one, two or three letters\.
- a dot.
The \1
in the replacement refers to the Group 1 value.
Note you should provide the regex=True
argument to Series.str.replace
to avoid the warning described in FutureWarning: The default value of regex will change from True to False in a future version.