pandas str extract use regex to select all letters from a string only returning 1st match

In Pandas I have a dataframe column called TermNew containing the following lowercase strings (please ignore bullet points - I was having trouble formatting)

TermNew

999 years from 1/01/2001
999 years (less 20 days) from 20/11/2000
99 years from 1/10/1979
999 years from 1/01/1992

Im trying to extract all letter characters (a-z only, no digits, no whitespace, no () or /) from TermNew into a new column Termtext with these expected outcomes

Termtext

yearsfrom
yearslessdaysfrom
yearsfrom
yearsfrom

Ive tried the following but its only returning letters up to the first white space i.e.

leaseterm1['Termtext'] = leaseterm1['TermNew'].str.extract(r"([a-z] )")

Outputs

years
years
years
years

In regex101 I can use the global flag to match all letters correctly See example

Questions

1/ Is this a problem with str extract only finding the first match or

2/ Is this a regex problem - I havent included any form of global search past the 1st whitespace?

Any suggestions gratefully received. Thanks

CodePudding user response：

It is easier to replace all characters other than a-z

leaseterm1['Termtext'] = leaseterm1['TermNew'].str.replace(r"[^a-z] ", "")

Output

                                    TermNew           Termtext
0                  999 years from 1/01/2001          yearsfrom
1  999 years (less 20 days) from 20/11/2000  yearslessdaysfrom
2                   99 years from 1/10/1979          yearsfrom
3                  999 years from 1/01/1992          yearsfrom

CodePudding user response：

You can use str.extractall() instead and aggregate the results of multiple matches, as follows:

leaseterm1['Termtext'] = leaseterm1['TermNew'].str.extractall(r"([a-z] )").groupby(level=0).agg(''.join)

or use GroupBy.sum for aggregation:

leaseterm1['Termtext'] = leaseterm1['TermNew'].str.extractall(r"([a-z] )").groupby(level=0).sum(numeric_only=False)

Result:

print(leaseterm1)

                                    TermNew           Termtext
0                  999 years from 1/01/2001          yearsfrom
1  999 years (less 20 days) from 20/11/2000  yearslessdaysfrom
2                   99 years from 1/10/1979          yearsfrom
3                  999 years from 1/01/1992          yearsfrom

Regarding to your questions:

As you can see from the official doc of str.extract()

For each subject string in the Series, extract groups from the first match of regular expression pat.

str.extract() extracts the first match only.

If you want to extract for multiple matches, you should use str.extractall() instead.

For str.extractall():

For each subject string in the Series, extract groups from all matches of regular expression pat.