Home > other >  Create new data.frame based on binary (0/1) data.frame
Create new data.frame based on binary (0/1) data.frame

Time:11-02

I have two data.frames (df1 and binary df2) and I want to create a third data.frame (df3) based on df1 conditioned on the binary data.frame df2. I want to set all values to zero if there is a zero at that position in the binary data.frame 2.

import pandas as pd
df1 = pd.DataFrame({
    'Date':['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04'],
    '01K W':['1.2', '2.3', '0.3', '0.5'], 
    '02K W':['3.5', '0.1', 'nan', 'nan'], 
    '03K W':['4.2', '5.2', '2.5', '3.0'], 
    '04K W':['1.5', '2.6', '8.2', '4.2']}) 

df2 = pd.DataFrame({
    'Date':['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04'],
    '01K W':['1', '1', '1', '1'], 
    '02K W':['1', '0', '0', '0'], 
    '03K W':['0', '0', '0', '1'], 
    '04K W':['0', '1', '1', '0']}) 

To create the new data.frame 3, I was thinking about a for loop over the columns and replacing the values in df1 based on df2 (values=0). So far, I wasn't able to figure out a correct syntax.

df3 = for col in df1.columns:
    if df2[col].value[1:] == 0:
        df3[col].value[1:] = 0
    else:
        df3[col].value[1:] = df1[col].value 

I would expect a data.frame which looks like that:

df3 = pd.DataFrame({
    'Date':['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04'],
    '01K W':['1.2', '2.3', '0.3', '0.5'], 
    '02K W':['3.5', '0', '0', '0'], 
    '03K W':['0', '0', '0', '3.0'], 
    '04K W':['0', '2.6', '8.2', '0']}) 

Thank you very much for your help :)

CodePudding user response:

You can index df1 by df2. Since df2 is of string dtype, first convert to boolean.

df2_bool = df2.set_index('Date').astype(int).astype(bool)

df1.set_index('Date')[df2_bool].fillna(0).reset_index()

You get

    Date        01K W   02K W   03K W   04K W
0   2021-01-01  1.2     3.5     0       0
1   2021-01-02  2.3     0       0       2.6
2   2021-01-03  0.3     0       0       8.2
3   2021-01-04  0.5     0       3.0     0

Ofcourse you can make it a one-line (and avoid saving boolean df2)

df1.set_index('Date')[df2.set_index('Date').astype(int).astype(bool)].fillna(0).reset_index()

CodePudding user response:

You can do something like this:

bm = df2.sort_values('Date').set_index('Date').astype(int).astype(bool).values

df1.sort_values('Date').set_index('Date').where(bm, 0).reset_index()

Basically you create a binary matrix:

df2.sort_values('Date').set_index('Date').astype(int).astype(bool).values

Then you use the pandas where clause.

NOTE: This will only work if all dates are in df1 and df2.

CodePudding user response:

This code below will work.

df1 = df1.set_index('Date').astype(float)
df2 = df2.set_index('Date').astype(float)
df1.fillna(0,inplace=True)
df3 = df1 * df2

You can use fillna() if df2 also has null values.

The operation is a simple matrix multiplication.

you can also do this.

df1.mul(df2)
  • Related