Home > other >  How to remove numbers from string but keep specific groups of numbers?
How to remove numbers from string but keep specific groups of numbers?

Time:04-23

I want to use python regular expression to remove numbers from string from keep number 754 and 1231 as they are related to tax section code 754 and sec code 1231. For example, I have the text data below:

test="""Dividends 9672
Dividends 9680
Interest Income
Ordinary Dividends
Royalties
Capital Gain Distributions
Income from Blackstone
Ordinary Income
Rental Income
Long Term Capital Gain
Short Term Capital Gain
1231 Gain
Section 754 Stock Basis Adjustment - 2015
M-1 Section 754 Stock Basis Adjustment - 2015
Section 754 Stock Basis Adjustment - 2018
M-1 Section 754 Stock basis adjustment - 2018
"""

and I want the output to be:

Dividends
Dividends
Interest Income
Ordinary Dividends
Royalties
Capital Gain Distributions
Income from Blackstone
Ordinary Income
Rental Income
Long Term Capital Gain
Short Term Capital Gain
1231 Gain
Section 754 Stock Basis Adjustment
M- Section 754 Stock Basis Adjustment
Section 754 Stock Basis Adjustment
M- Section 754 Stock basis adjustment

my solution is:

test=re.sub(r'[^(754)(1231)A-Za-z]','',test)
print(test)

but it doesn't look at 754 or 1231 as whole group and only removes digit 6,8,9.

CodePudding user response:

You can use

re.sub(r'(754|1231)|[^A-Za-z\s]', r'\1', text)

See the regex demo.

Here, (754|1231) matches and captures into Group 1 a 754 or 1231 digit sequences, and then |[^A-Za-z\s] matches any char other than an ASCII letter or any Unicode whitespace, and the matches are replaced with Group 1 value (i.e. what was captured remains in the string).

Note: if the numbers are to be matched as exact numbers use digit boundaries:

re.sub(r'(?<!\d)(754|1231)(?!\d)|[^A-Za-z\s]', r'\1', text)
  • Related