I am just not quite getting this regex. Can someone help please.
Start with this:
s = pd.Series(['Up to £48,000 a year', '£50,000 - £60,000 a year', '£11.86 - £14.69 an hour'])
0 Up to £48,000 a year
1 £50,000 - £60,000 a year
2 £11.86 - £14.69 an hour
3 no salary
First try (missing £48,000)
s = pd.Series(['Up to £48,000 a year', '£50,000 - £60,000 a year', '£11.86 - £14.69 an hour', 'no salary'])
s.str.extract(r'(\d\d[.,]\d\d\d?).*(\d\d[.,]\d\d\d?)')
0 1
0 NaN NaN
1 50,000 60,000
2 11.86 14.69
3 NaN NaN
Second try (missing all in second column)
s = pd.Series(['Up to £48,000 a year', '£50,000 - £60,000 a year', '£11.86 - £14.69 an hour', 'no salary'])
s.str.extract(r'(\d\d[.,]\d\d\d?).*(\d\d[.,]\d\d\d?)?')
0 1
0 48,000 NaN
1 50,000 NaN
2 11.86 NaN
3 NaN NaN
How do I get this?
0 1
0 48,000 NaN
1 50,000 60,000
2 11.86 14.69
3 NaN NaN
CodePudding user response:
Try Series.str.extractall
.
It is possible to find all capturing group specified in the RegEx pattern and reshape after using unstack
.
s.str.extractall(r'([.\d][.,\d]*)').unstack()
match 0 1
0 48,000 NaN
1 50,000 60,000
2 11.86 14.69
You can test it using regex101
CodePudding user response:
A not very elegant approach that does this specific job
vals = s.values
rgx = '[A-Za-z£-]' # instead of finding the digits, get rid of eerything else
vals = list(map(lambda x: re.sub(rgx,'',x).strip(), vals))
vals = list(map(lambda x: re.split(r'\s ',x),vals))
vals = list(map(lambda x: x if len(x) == 2 else [x[0],np.NAN],vals))
s = pd.DataFrame(vals)
print(s)
0 1
0 48,000 NaN
1 50,000 60,000
2 11.86 14.69
Or something like this:
df = pd.DataFrame(s.values)
rgx = '[A-Za-z£-]'
df[0] = df[0].str.replace(rgx,'', regex = True)
df = df[0].str.split(r'\s ',expand = True)
df.pop(df.columns[-1])
None of them are ideal for a more general case, but if it works...