Home > Enterprise >  How to create 0/1 DataFrame based on two other DataFrame with conditions?
How to create 0/1 DataFrame based on two other DataFrame with conditions?

Time:12-02

I try to create a DataFrame df3 based on two other DataFrames df1 and df2. If the value in df1 is bigger or equal the value in df2, then I want to assign a 1 in df3 and otherwise a 0.

This would be a sample data for illustration:

df1
    Date        01K  02K  03K   04K
0   2021-01-01  NaN  3.5  4.2   NaN
1   2021-01-02  2.3  0.1  5.2   2.6
2   2021-01-03  0.3  NaN  2.5   8.2
3   2021-01-04  0.4  NaN  3.0   4.2

df2
    Date        01K  02K  03K   04K
0   2021-01-01  NaN  NaN  NaN   NaN
1   2021-01-02  2.5  0.6  5.8   2.3
2   2021-01-03  0.2  NaN  2.5   8.1
3   2021-01-04  0.3  NaN  2.8   4.2

df3
    Date        01K  02K  03K   04K
0   2021-01-01  0    0    0     0
1   2021-01-02  0    0    0     1
2   2021-01-03  1    0    1     1
3   2021-01-04  1    0    1     1

For reproducibility:

df1 = pd.DataFrame({
    'Date':['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04'],
    '01K':['NaN', 2.3, 0.3, 0.4], 
    '02K':[3.5, 0.1, 'NaN', 'NaN'], 
    '03K':[4.2, 5.2, 2.5, 3.0], 
    '04K':['NaN', 2.6, 8.2, 4.2]}) 
df1 = df1.replace('NaN',np.nan)

df2 = pd.DataFrame({
    'Date':['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04'],
    '01K':['NaN', 2.5, 0.2, 0.3], 
    '02K':['NaN', 0.6, 'NaN', 'NaN'], 
    '03K':['NaN', 5.8, 2.5, 2.8], 
    '04K':['NaN', 2.3, 8.1, 4.2]}) 
df2 = df2.replace('NaN',np.nan)

I tried to do it with np.where(), but I get an array as output.

df3 = np.where(df1>=df2, 1, 0)

array([[1, 0, 0, 0, 0],
       [1, 0, 0, 0, 1],
       [1, 1, 0, 1, 1],
       [1, 1, 0, 1, 1]])

Is there an efficient way to create df3 with the same structure as the other DataFrames for a big dataset?

Thanks a lot!

CodePudding user response:

Convert Date to DatetimeIndex in both, so possible compare with casting ouput to integers for map True, Falses to 1, 0:

df1 = df1.set_index('Date')
df2 = df2.set_index('Date')

df3 = (df1>=df2).astype(int)

If need instead 1 using -1 use:

df3 = (df1>=df2).astype(int).mul(-1)

df3 = pd.DataFrame(np.where(df1>=df2, -1, 0), 
                  index=df1.index, 
                  columns=df1.columns)
  • Related