In Pandas I have a dataframe column called TermNew containing the following lowercase strings (please ignore bullet points - I was having trouble formatting)
TermNew
999 years from 1/01/2001
999 years (less 20 days) from 20/11/2000
99 years from 1/10/1979
999 years from 1/01/1992
Im trying to extract all letter characters (a-z only, no digits, no whitespace, no () or /) from TermNew into a new column Termtext with these expected outcomes
Termtext
- yearsfrom
- yearslessdaysfrom
- yearsfrom
- yearsfrom
Ive tried the following but its only returning letters up to the first white space i.e.
leaseterm1['Termtext'] = leaseterm1['TermNew'].str.extract(r"([a-z] )")
Outputs
years
years
years
years
In regex101 I can use the global flag to match all letters correctly See example
Questions
1/ Is this a problem with str extract only finding the first match or
2/ Is this a regex problem - I havent included any form of global search past the 1st whitespace?
Any suggestions gratefully received. Thanks
CodePudding user response:
It is easier to replace all characters other than a-z
leaseterm1['Termtext'] = leaseterm1['TermNew'].str.replace(r"[^a-z] ", "")
Output
TermNew Termtext
0 999 years from 1/01/2001 yearsfrom
1 999 years (less 20 days) from 20/11/2000 yearslessdaysfrom
2 99 years from 1/10/1979 yearsfrom
3 999 years from 1/01/1992 yearsfrom
CodePudding user response:
You can use str.extractall()
instead and aggregate the results of multiple matches, as follows:
leaseterm1['Termtext'] = leaseterm1['TermNew'].str.extractall(r"([a-z] )").groupby(level=0).agg(''.join)
or use GroupBy.sum
for aggregation:
leaseterm1['Termtext'] = leaseterm1['TermNew'].str.extractall(r"([a-z] )").groupby(level=0).sum(numeric_only=False)
Result:
print(leaseterm1)
TermNew Termtext
0 999 years from 1/01/2001 yearsfrom
1 999 years (less 20 days) from 20/11/2000 yearslessdaysfrom
2 99 years from 1/10/1979 yearsfrom
3 999 years from 1/01/1992 yearsfrom
Regarding to your questions:
As you can see from the official doc of str.extract()
For each subject string in the Series, extract groups from the first match of regular expression pat.
str.extract()
extracts the first match only.
If you want to extract for multiple matches, you should use str.extractall()
instead.
For str.extractall()
:
For each subject string in the Series, extract groups from all matches of regular expression pat.