Home > Blockchain >  Pandas : Apply regex with multiple patterns and expand the results along with nan values
Pandas : Apply regex with multiple patterns and expand the results along with nan values

Time:07-21

Say I have a dataframe :

df = pd.DataFrame({'date' : ['Nov 21, 2000','Sep 08, 2000','Nov, 1999','Nov 26, 1998','Apr, 2002','2001','Mar 26, 1998', 'Oct 07, 1999', 'Nov 20, 1995', 'Mar 15, 1996']})
           date
0  Nov 21, 2000
1  Sep 08, 2000
2     Nov, 1999
3  Nov 26, 1998
4     Apr, 2002
5          2001
6  Mar 26, 1998
7  Oct 07, 1999
8  Nov 20, 1995
9  Mar 15, 1996

And I have a regex which has multiple patterns

p = '(\\d{4})|\\s(\\d{2}),|([aA-zZ] )'
df['date'].str.extractall(p).apply(lambda x: x,axis=0)

I get the values but NaN values for other corresponding rows

            0    1    2
  match                
0 0       NaN  NaN  Nov
  1       NaN   21  NaN
  2      2000  NaN  NaN
1 0       NaN  NaN  Sep
  1       NaN   08  NaN
  2      2000  NaN  NaN
2 0       NaN  NaN  Nov
  1      1999  NaN  NaN
3 0       NaN  NaN  Nov
  1       NaN   26  NaN
  2      1998  NaN  NaN
4 0       NaN  NaN  Apr
  1      2002  NaN  NaN
5 0      2001  NaN  NaN
6 0       NaN  NaN  Mar
  1       NaN   26  NaN
  2      1998  NaN  NaN
7 0       NaN  NaN  Oct
  1       NaN   07  NaN
  2      1999  NaN  NaN
8 0       NaN  NaN  Nov
  1       NaN   20  NaN
  2      1995  NaN  NaN
9 0       NaN  NaN  Mar
  1       NaN   15  NaN
  2      1996  NaN  NaN

Expected Output :

         0    1    2
                
0       2000  21  Nov
1       2000  08  Sep
2       1999  NaN Nov
3       1998  26  Nov
4       2002  NaN Apr
... and so on

For each of the row I need to get 3 values, if the first group is not present, it should return me NA in place of that.

df['date'].str.extractall(p).apply(lambda x: ','.join(x.dropna()),axis=1).groupby(level=0).apply(', '.join)

This gives me

    0    Nov, 21, 2000
    1    Sep, 08, 2000
    2        Nov, 1999
    3    Nov, 26, 1998
    4        Apr, 2002
    5             2001
    6    Mar, 26, 1998
    7    Oct, 07, 1999
    8    Nov, 20, 1995
    9    Mar, 15, 1996
    dtype: object

CodePudding user response:

The [aA-zZ] regex part is not matching only letters, the letter matching regex is [a-zA-Z].

Besides, you are joining matches with a comma into single strings, while you want to get three separate columns.

You can use a fixed pattern of yours and merge the groups by using

import pandas as pd
df = pd.DataFrame({'date' : ['Nov 21, 2000','Sep 08, 2000','Nov, 1999','Nov 26, 1998','Apr, 2002','2001','Mar 26, 1998', 'Oct 07, 1999', 'Nov 20, 1995', 'Mar 15, 1996']})

p = r'(\d{4})|\s(\d{2}),|([a-zA-Z] )'
df['date'].str.extractall(p).groupby(level=0).first()

Output:

      0     1     2
0  2000    21   Nov
1  2000    08   Sep
2  1999  None   Nov
3  1998    26   Nov
4  2002  None   Apr
5  2001  None  None
6  1998    26   Mar
7  1999    07   Oct
8  1995    20   Nov
9  1996    15   Mar

That is, after you collect all matches, group them by first level and use first() to get the dataframe.

CodePudding user response:

You can try

p = r'([a-zA-Z]{3})? ?(\d{2})?,? ?(\d{4})'
out = df['date'].str.extractall(p)
print(out)

           0    1     2
  match
0 0      Nov   21  2000
1 0      Sep   08  2000
2 0      Nov  NaN  1999
3 0      Nov   26  1998
4 0      Apr  NaN  2002
5 0      NaN  NaN  2001
6 0      Mar   26  1998
7 0      Oct   07  1999
8 0      Nov   20  1995
9 0      Mar   15  1996
  • Related