Home > Net >  pandas str extract use regex to select all letters from a string only returning 1st match - str extr
pandas str extract use regex to select all letters from a string only returning 1st match - str extr

Time:10-27

In Pandas I have a dataframe column called TermNew containing the following lowercase strings (please ignore bullet points - I was having trouble formatting)

TermNew

  • 999 years from 1/01/2001

  • 999 years (less 20 days) from 20/11/2000

  • 99 years from 1/10/1979

  • 999 years from 1/01/1992

Im trying to extract all letter characters (a-z only, no digits, no whitespace, no () or /) from TermNew into a new column Termtext with these expected outcomes

Termtext

  • yearsfrom
  • yearslessdaysfrom
  • yearsfrom
  • yearsfrom

Ive tried the following but its only returning letters up to the first white space i.e.

leaseterm1['Termtext'] = leaseterm1['TermNew'].str.extract(r"([a-z] )")

Outputs

  • years

  • years

  • years

  • years

In regex101 I can use the global flag to match all letters correctly See example

Questions

1/ Is this a problem with str extract only finding the first match or

2/ Is this a regex problem - I havent included any form of global search past the 1st whitespace?

Any suggestions gratefully received. Thanks

CodePudding user response:

It is easier to replace all characters other than a-z

leaseterm1['Termtext'] = leaseterm1['TermNew'].str.replace(r"[^a-z] ", "")

Output

                                    TermNew           Termtext
0                  999 years from 1/01/2001          yearsfrom
1  999 years (less 20 days) from 20/11/2000  yearslessdaysfrom
2                   99 years from 1/10/1979          yearsfrom
3                  999 years from 1/01/1992          yearsfrom

CodePudding user response:

You can use str.extractall() instead and aggregate the results of multiple matches, as follows:

leaseterm1['Termtext'] = leaseterm1['TermNew'].str.extractall(r"([a-z] )").groupby(level=0).agg(''.join)

or use GroupBy.sum for aggregation:

leaseterm1['Termtext'] = leaseterm1['TermNew'].str.extractall(r"([a-z] )").groupby(level=0).sum(numeric_only=False)

Result:

print(leaseterm1)

                                    TermNew           Termtext
0                  999 years from 1/01/2001          yearsfrom
1  999 years (less 20 days) from 20/11/2000  yearslessdaysfrom
2                   99 years from 1/10/1979          yearsfrom
3                  999 years from 1/01/1992          yearsfrom

Regarding to your questions:

As you can see from the official doc of str.extract()

For each subject string in the Series, extract groups from the first match of regular expression pat.

str.extract() extracts the first match only.

If you want to extract for multiple matches, you should use str.extractall() instead.

For str.extractall():

For each subject string in the Series, extract groups from all matches of regular expression pat.

  • Related