Home > Software engineering >  pandas extract two capture groups get all
pandas extract two capture groups get all

Time:03-07

I am just not quite getting this regex. Can someone help please.

Start with this:

s = pd.Series(['Up to £48,000 a year', '£50,000 - £60,000 a year', '£11.86 - £14.69 an hour'])

0        Up to £48,000 a year
1    £50,000 - £60,000 a year
2     £11.86 - £14.69 an hour
3                   no salary

First try (missing £48,000)

s = pd.Series(['Up to £48,000 a year', '£50,000 - £60,000 a year', '£11.86 - £14.69 an hour', 'no salary'])
s.str.extract(r'(\d\d[.,]\d\d\d?).*(\d\d[.,]\d\d\d?)')

    0       1
0   NaN     NaN
1   50,000  60,000
2   11.86   14.69
3   NaN     NaN

Second try (missing all in second column)

s = pd.Series(['Up to £48,000 a year', '£50,000 - £60,000 a year', '£11.86 - £14.69 an hour', 'no salary'])
s.str.extract(r'(\d\d[.,]\d\d\d?).*(\d\d[.,]\d\d\d?)?')

    0       1
0   48,000  NaN
1   50,000  NaN
2   11.86   NaN
3   NaN     NaN

How do I get this?

    0       1
0   48,000  NaN
1   50,000  60,000
2   11.86   14.69
3   NaN     NaN

CodePudding user response:

Try Series.str.extractall.

It is possible to find all capturing group specified in the RegEx pattern and reshape after using unstack.

s.str.extractall(r'([.\d][.,\d]*)').unstack()
match       0       1
0      48,000     NaN
1      50,000  60,000
2       11.86   14.69

You can test it using regex101

CodePudding user response:

A not very elegant approach that does this specific job

vals = s.values
rgx = '[A-Za-z£-]' # instead of finding the digits, get rid of eerything else
vals = list(map(lambda x: re.sub(rgx,'',x).strip(), vals))
vals = list(map(lambda x: re.split(r'\s ',x),vals))
vals = list(map(lambda x: x if len(x) == 2 else [x[0],np.NAN],vals))
s = pd.DataFrame(vals)
print(s)

        0       1
0  48,000     NaN
1  50,000  60,000
2   11.86   14.69

Or something like this:

df = pd.DataFrame(s.values)
rgx = '[A-Za-z£-]'
df[0] = df[0].str.replace(rgx,'', regex = True)
df = df[0].str.split(r'\s ',expand = True)
df.pop(df.columns[-1])

None of them are ideal for a more general case, but if it works...

  • Related