Say I have a dataframe :
df = pd.DataFrame({'date' : ['Nov 21, 2000','Sep 08, 2000','Nov, 1999','Nov 26, 1998','Apr, 2002','2001','Mar 26, 1998', 'Oct 07, 1999', 'Nov 20, 1995', 'Mar 15, 1996']})
date
0 Nov 21, 2000
1 Sep 08, 2000
2 Nov, 1999
3 Nov 26, 1998
4 Apr, 2002
5 2001
6 Mar 26, 1998
7 Oct 07, 1999
8 Nov 20, 1995
9 Mar 15, 1996
And I have a regex which has multiple patterns
p = '(\\d{4})|\\s(\\d{2}),|([aA-zZ] )'
df['date'].str.extractall(p).apply(lambda x: x,axis=0)
I get the values but NaN
values for other corresponding rows
0 1 2
match
0 0 NaN NaN Nov
1 NaN 21 NaN
2 2000 NaN NaN
1 0 NaN NaN Sep
1 NaN 08 NaN
2 2000 NaN NaN
2 0 NaN NaN Nov
1 1999 NaN NaN
3 0 NaN NaN Nov
1 NaN 26 NaN
2 1998 NaN NaN
4 0 NaN NaN Apr
1 2002 NaN NaN
5 0 2001 NaN NaN
6 0 NaN NaN Mar
1 NaN 26 NaN
2 1998 NaN NaN
7 0 NaN NaN Oct
1 NaN 07 NaN
2 1999 NaN NaN
8 0 NaN NaN Nov
1 NaN 20 NaN
2 1995 NaN NaN
9 0 NaN NaN Mar
1 NaN 15 NaN
2 1996 NaN NaN
Expected Output :
0 1 2
0 2000 21 Nov
1 2000 08 Sep
2 1999 NaN Nov
3 1998 26 Nov
4 2002 NaN Apr
... and so on
For each of the row I need to get 3 values, if the first group is not present, it should return me NA in place of that.
df['date'].str.extractall(p).apply(lambda x: ','.join(x.dropna()),axis=1).groupby(level=0).apply(', '.join)
This gives me
0 Nov, 21, 2000
1 Sep, 08, 2000
2 Nov, 1999
3 Nov, 26, 1998
4 Apr, 2002
5 2001
6 Mar, 26, 1998
7 Oct, 07, 1999
8 Nov, 20, 1995
9 Mar, 15, 1996
dtype: object
CodePudding user response:
The [aA-zZ]
regex part is not matching only letters, the letter matching regex is [a-zA-Z]
.
Besides, you are joining matches with a comma into single strings, while you want to get three separate columns.
You can use a fixed pattern of yours and merge the groups by using
import pandas as pd
df = pd.DataFrame({'date' : ['Nov 21, 2000','Sep 08, 2000','Nov, 1999','Nov 26, 1998','Apr, 2002','2001','Mar 26, 1998', 'Oct 07, 1999', 'Nov 20, 1995', 'Mar 15, 1996']})
p = r'(\d{4})|\s(\d{2}),|([a-zA-Z] )'
df['date'].str.extractall(p).groupby(level=0).first()
Output:
0 1 2
0 2000 21 Nov
1 2000 08 Sep
2 1999 None Nov
3 1998 26 Nov
4 2002 None Apr
5 2001 None None
6 1998 26 Mar
7 1999 07 Oct
8 1995 20 Nov
9 1996 15 Mar
That is, after you collect all matches, group them by first level and use first()
to get the dataframe.
CodePudding user response:
You can try
p = r'([a-zA-Z]{3})? ?(\d{2})?,? ?(\d{4})'
out = df['date'].str.extractall(p)
print(out)
0 1 2
match
0 0 Nov 21 2000
1 0 Sep 08 2000
2 0 Nov NaN 1999
3 0 Nov 26 1998
4 0 Apr NaN 2002
5 0 NaN NaN 2001
6 0 Mar 26 1998
7 0 Oct 07 1999
8 0 Nov 20 1995
9 0 Mar 15 1996