i am trying to remove all kind of bullet points with different formats, this is basically the cases that i have:
c.2 Employed population below international poverty line, by sex and age (%) Age: 15 b.1 Employed population below international poverty line, by sex and age (%) Age: 15 a.1 Employed population below international poverty line, by sex and age (%) Age: 15
- Employed population below international poverty line, by sex and age (%) Age: 15 1.2 Employed population below international poverty line, by sex and age (%) Age: 15 1.1.1 Employed population below international poverty line, by sex and age (%) Age: 15 5.6.2 (S.1.C.1) Employed population below international poverty line, by sex and age (%) Age: 15 5.6.2 (S.3) Employed population below international poverty line, by sex and age (%) Age: 15 5.6.2 (S.4.C.13) Employed population below international poverty line, by sex and age (%) Age: 15
i want a regex to remove the bullet points no matter what form they are in and have only : Employed population below international poverty line, by sex and age (%) Age: 15
i tried to use ^(?:\d \.) \d*\s*
it works fine but it only detects 1. or 1.2 or 1.1.1 thats what i wanted in the beginning so it was correct, but now my given is changed to this.
Thank you in advance, side note: i use python3
CodePudding user response:
^[a-z\d ]\.(\d )?\.?(\d )?(\s\(.*\)\s)?\s
This one is catching all types of bullet points in your example, here's the proof: https://regex101.com/r/sj4PgN/2
CodePudding user response:
You can use
^(?:[a-z]|\d )(?:\.\d )*\.?\s*(?:\([^()]*\)\s*)?
Explanation
^
Start of string(?:[a-z]|\d )
either match a char a-z or match 1 digits(?:\.\d )*
Optionally repeat.
and 1 digits\.?
Match an optional dot\s*
Match optional whitespace chars(?:\([^()]*\)\s*)?
Optionally match a part(...)
followed by optiinal spaces
In the replacement use an empty string.
If the part between the parenthesis is of the given specific pattern being an uppercase char A-Z followed by a dot and digit(s):
^(?:[a-z]|\d )(?:\.\d )*\.?\s*(?:\([A-Z]\.\d (?:\.[A-Z]\.\d )*\)\s*)?
Example
import re
pattern = r"^(?:[a-z]|\d )(?:\.\d )*\.?\s*(?:\([^()]*\)\s*)?"
s = ("c.2 Employed population below international poverty line, by sex and age (%) Age: 15 \n"
"b.1 Employed population below international poverty line, by sex and age (%) Age: 15 \n"
"a.1 Employed population below international poverty line, by sex and age (%) Age: 15 \n"
"1. Employed population below international poverty line, by sex and age (%) Age: 15 \n"
"1.2 Employed population below international poverty line, by sex and age (%) Age: 15 \n"
"1.1.1 Employed population below international poverty line, by sex and age (%) Age: 15 \n"
"5.6.2 (S.1.C.1) Employed population below international poverty line, by sex and age (%) Age: 15 \n"
"5.6.2 (S.3) Employed population below international poverty line, by sex and age (%) Age: 15 \n"
"5.6.2 (S.4.C.13) Employed population below international poverty line, by sex and age (%) Age: 15 ")
result = re.sub(pattern, "", s, 0, re.MULTILINE)
if result:
print(result)
Output
Employed population below international poverty line, by sex and age (%) Age: 15
Employed population below international poverty line, by sex and age (%) Age: 15
Employed population below international poverty line, by sex and age (%) Age: 15
Employed population below international poverty line, by sex and age (%) Age: 15
Employed population below international poverty line, by sex and age (%) Age: 15
Employed population below international poverty line, by sex and age (%) Age: 15
Employed population below international poverty line, by sex and age (%) Age: 15
Employed population below international poverty line, by sex and age (%) Age: 15
Employed population below international poverty line, by sex and age (%) Age: 15