Home > Software design >  Masking on the entire df
Masking on the entire df

Time:10-19

I have two dataframes

df1

timestamp  ABC_d XYZ_d PQR_d ...
2018-01-01  16    nan    nan 
2018-01-02  15    nan    nan 
2018-01-03  14    nan    nan 
2018-01-04  nan    15    nan 
2018-01-05  nan    13    nan 
2018-01-06  nan    nan    17
2018-01-07  nan    nan    16
2018-01-08  nan    nan    15

df2

timestamp  ABC_d XYZ_d PQR_d ...
2018-01-01  3.0   2     5
2018-01-02  66    8 
2018-01-03  5     8     6 
2018-01-04  115   nan 
2018-01-05  nan  13     12 
2018-01-06  8    11     10
2018-01-07  7    5      10
2018-01-08  nan  nan    10

I want to create a new df such that only the non-na values/locations of df1 are picked and used for selecting data from df2. So effectively need to mask the entire df1 for criteria of non na, and then apply on df2. So the interim df would be:

timestamp  ABC_d XYZ_d PQR_d ...
2018-01-01  3.0    nan    nan 
2018-01-02  66    nan    nan 
2018-01-03  5    nan    nan 
2018-01-04  nan    15    nan 
2018-01-05  nan    13    nan 
2018-01-06  nan    nan    10
2018-01-07  nan    nan    10
2018-01-08  nan    nan    10

How can I efficiently achieve this ? Later I would like to bring all the various column values to the first column, and the second column to indicate the original column from where it came, resulting in :

timestamp  Numero Class 
2018-01-01  3.0   ABC_d
2018-01-02  66    ABC_d  
2018-01-03  5     ABC_d 
2018-01-04  15    XYZ_d 
2018-01-05  13    XYZ_d 
2018-01-06  10    PQR_d 
2018-01-07  10    PQR_d 
2018-01-08  10    PQR_d 

CodePudding user response:

You can compare values in DataFrame.where with test not missing values, last reshape values by DataFrame.stack:

df = df2.set_index('timestamp').where(df1.set_index('timestamp').notna())

print (df)
            ABC_d  XYZ_d  PQR_d 
timestamp                           
2018-01-01    3.0    NaN    NaN  
2018-01-02   66.0    NaN    NaN  
2018-01-03    5.0    NaN    NaN  
2018-01-04    NaN    NaN    NaN  
2018-01-05    NaN   13.0    NaN  
2018-01-06    NaN    NaN   10.0  
2018-01-07    NaN    NaN   10.0  
2018-01-08    NaN    NaN   10.0  


df = (df.rename_axis('Class', axis=1)
        .stack()
        .reset_index(name='Numero')[['timestamp','Numero','Class']])
print (df)
    timestamp  Numero  Class
0  2018-01-01     3.0  ABC_d
1  2018-01-02    66.0  ABC_d
2  2018-01-03     5.0  ABC_d
3  2018-01-05    13.0  XYZ_d
4  2018-01-06    10.0  PQR_d
5  2018-01-07    10.0  PQR_d
6  2018-01-08    10.0  PQR_d

CodePudding user response:

Can also use:

df = df_2.mask(df_1.isnull())

OUTPUT:

    timestamp  ABC_d  XYZ_d  PQR_d
0  2018-01-01    3.0    NaN    NaN
1  2018-01-02   66.0    NaN    NaN
2  2018-01-03    5.0    NaN    NaN
3  2018-01-04    NaN    NaN    NaN
4  2018-01-05    NaN   13.0    NaN
5  2018-01-06    NaN    NaN   10.0
6  2018-01-07    NaN    NaN   10.0
7  2018-01-08    NaN    NaN   10.0

Then use this:

df = df.melt(id_vars=['timestamp'], value_vars=['ABC_d', 'XYZ_d', 'PQR_d'], var_name='Class', value_name='Numero')
df.dropna(subset=['Numero'], inplace=True)

OUTPUT:

     timestamp  Class  Numero
0   2018-01-01  ABC_d     3.0
1   2018-01-02  ABC_d    66.0
2   2018-01-03  ABC_d     5.0
12  2018-01-05  XYZ_d    13.0
21  2018-01-06  PQR_d    10.0
22  2018-01-07  PQR_d    10.0
23  2018-01-08  PQR_d    10.0
  • Related