I have two dataframes
df1
timestamp ABC_d XYZ_d PQR_d ...
2018-01-01 16 nan nan
2018-01-02 15 nan nan
2018-01-03 14 nan nan
2018-01-04 nan 15 nan
2018-01-05 nan 13 nan
2018-01-06 nan nan 17
2018-01-07 nan nan 16
2018-01-08 nan nan 15
df2
timestamp ABC_d XYZ_d PQR_d ...
2018-01-01 3.0 2 5
2018-01-02 66 8
2018-01-03 5 8 6
2018-01-04 115 nan
2018-01-05 nan 13 12
2018-01-06 8 11 10
2018-01-07 7 5 10
2018-01-08 nan nan 10
I want to create a new df such that only the non-na values/locations of df1 are picked and used for selecting data from df2. So effectively need to mask the entire df1 for criteria of non na, and then apply on df2. So the interim df would be:
timestamp ABC_d XYZ_d PQR_d ...
2018-01-01 3.0 nan nan
2018-01-02 66 nan nan
2018-01-03 5 nan nan
2018-01-04 nan 15 nan
2018-01-05 nan 13 nan
2018-01-06 nan nan 10
2018-01-07 nan nan 10
2018-01-08 nan nan 10
How can I efficiently achieve this ? Later I would like to bring all the various column values to the first column, and the second column to indicate the original column from where it came, resulting in :
timestamp Numero Class
2018-01-01 3.0 ABC_d
2018-01-02 66 ABC_d
2018-01-03 5 ABC_d
2018-01-04 15 XYZ_d
2018-01-05 13 XYZ_d
2018-01-06 10 PQR_d
2018-01-07 10 PQR_d
2018-01-08 10 PQR_d
CodePudding user response:
You can compare values in DataFrame.where
with test not missing values, last reshape values by DataFrame.stack
:
df = df2.set_index('timestamp').where(df1.set_index('timestamp').notna())
print (df)
ABC_d XYZ_d PQR_d
timestamp
2018-01-01 3.0 NaN NaN
2018-01-02 66.0 NaN NaN
2018-01-03 5.0 NaN NaN
2018-01-04 NaN NaN NaN
2018-01-05 NaN 13.0 NaN
2018-01-06 NaN NaN 10.0
2018-01-07 NaN NaN 10.0
2018-01-08 NaN NaN 10.0
df = (df.rename_axis('Class', axis=1)
.stack()
.reset_index(name='Numero')[['timestamp','Numero','Class']])
print (df)
timestamp Numero Class
0 2018-01-01 3.0 ABC_d
1 2018-01-02 66.0 ABC_d
2 2018-01-03 5.0 ABC_d
3 2018-01-05 13.0 XYZ_d
4 2018-01-06 10.0 PQR_d
5 2018-01-07 10.0 PQR_d
6 2018-01-08 10.0 PQR_d
CodePudding user response:
Can also use:
df = df_2.mask(df_1.isnull())
OUTPUT:
timestamp ABC_d XYZ_d PQR_d
0 2018-01-01 3.0 NaN NaN
1 2018-01-02 66.0 NaN NaN
2 2018-01-03 5.0 NaN NaN
3 2018-01-04 NaN NaN NaN
4 2018-01-05 NaN 13.0 NaN
5 2018-01-06 NaN NaN 10.0
6 2018-01-07 NaN NaN 10.0
7 2018-01-08 NaN NaN 10.0
Then use this:
df = df.melt(id_vars=['timestamp'], value_vars=['ABC_d', 'XYZ_d', 'PQR_d'], var_name='Class', value_name='Numero')
df.dropna(subset=['Numero'], inplace=True)
OUTPUT:
timestamp Class Numero
0 2018-01-01 ABC_d 3.0
1 2018-01-02 ABC_d 66.0
2 2018-01-03 ABC_d 5.0
12 2018-01-05 XYZ_d 13.0
21 2018-01-06 PQR_d 10.0
22 2018-01-07 PQR_d 10.0
23 2018-01-08 PQR_d 10.0