Home > Enterprise >  Pandas conditional filling failing
Pandas conditional filling failing

Time:12-10

import pandas as pd
import numpy as np

sample_data = [
  {'Date': '13-12-2020', 'usable': 1, 'infected': 'Case3'},
  {'Date': '14-12-2020', 'usable': 1},
  {'Date': '15-12-2020', 'usable': 0},
  {'Date': '16-12-2020', 'usable': 1, 'infected': 'Case33'},
  {'Date': '17-12-2020', 'usable': 1},
  {'Date': '18-12-2020', 'usable': 1},  
  {'Date': '19-12-2020', 'usable': 0},
  {'Date': '20-12-2020', 'usable': 0},
  {'Date': '21-12-2020', 'usable': 0, 'infected': 'Case#'},
  {'Date': '22-12-2020', 'usable': 1},
  {'Date': '23-12-2020', 'usable': 1},
  {'Date': '24-12-2020', 'usable': 0},
  {'Date': '25-12-2020', 'usable': 0},
  {'Date': '26-12-2020', 'usable': 1, 'infected': 'Case46'},
  {'Date': '27-12-2020', 'usable': 0},
  {'Date': '28-12-2020', 'usable': 1},
  ]

df = pd.DataFrame(sample_data)
df['infected'] = df['infected'].ffill(limit=2).bfill(limit=2)
df['infected'] = np.where(df['usable']==0, np.NaN, df['infected'])

Above is my dataframe and how I am doing filling. Want to fill to upto 2 nearest location both forward and backward, subjected to following conditions:

  1. fill infected value only if it is usable=1
  2. fill to upto 2 nearest entry on each side (again only when usable = 1)
  3. want the filling to stop when you reach a zero

But, getting wrong output by below snippet:

df['infected'] = df['infected'].ffill(limit=2).bfill(limit=2)
df['infected'] = np.where(df['usable']==0, np.NaN, df['infected'])

Expected output:

expected = [
  {'Date': '13-12-2020', 'usable': 1, 'infected': 'Case3'},
  {'Date': '14-12-2020', 'usable': 1, 'infected': 'Case3'},
  {'Date': '15-12-2020', 'usable': 0},
  {'Date': '16-12-2020', 'usable': 1, 'infected': 'Case33'},
  {'Date': '17-12-2020', 'usable': 1, 'infected': 'Case33'},
  {'Date': '18-12-2020', 'usable': 1, 'infected': 'Case33'},  
  {'Date': '19-12-2020', 'usable': 0},
  {'Date': '20-12-2020', 'usable': 0},
  {'Date': '21-12-2020', 'usable': 0, 'infected': 'Case#'},
  {'Date': '22-12-2020', 'usable': 1},
  {'Date': '23-12-2020', 'usable': 1},
  {'Date': '24-12-2020', 'usable': 0},
  {'Date': '25-12-2020', 'usable': 0},
  {'Date': '26-12-2020', 'usable': 1, 'infected': 'Case46'},
  {'Date': '27-12-2020', 'usable': 0},
  {'Date': '28-12-2020', 'usable': 1},
  ]

df_expected = pd.DataFrame(expected)

'28-12-2020' doesn't have infected as 'Case46' because 27-12-2020 has usable = 0 so can't transfer (ffill)

CodePudding user response:

A solution using bfill and ffill

import pandas as pd
import numpy as np

sample_data = [
  {'Date': '13-12-2020', 'usable': 1, 'infected': 'Case3'},
  {'Date': '14-12-2020', 'usable': 1},
  {'Date': '15-12-2020', 'usable': 0},
  {'Date': '16-12-2020', 'usable': 1, 'infected': 'Case33'},
  {'Date': '17-12-2020', 'usable': 1},
  {'Date': '18-12-2020', 'usable': 1},  
  {'Date': '19-12-2020', 'usable': 0},
  {'Date': '20-12-2020', 'usable': 0},
  {'Date': '21-12-2020', 'usable': 0, 'infected': 'Case#'},
  {'Date': '22-12-2020', 'usable': 1},
  {'Date': '23-12-2020', 'usable': 1},
  {'Date': '24-12-2020', 'usable': 0},
  {'Date': '25-12-2020', 'usable': 0},
  {'Date': '26-12-2020', 'usable': 1, 'infected': 'Case46'},
  {'Date': '27-12-2020', 'usable': 0},
  {'Date': '28-12-2020', 'usable': 1},
  ]

df = pd.DataFrame(sample_data)

# Make a infected1 where we keep only the ones that we want to extend
df['infected1'] =  np.where(df['usable']==0, np.NaN, df['infected'])
# Make a infected2 with the extension of column1
df['infected2'] = df['infected1'].ffill(limit=1).bfill(limit=1)
# Remove the ones that should not have been extended (usable is 0)
df['infected3'] =  np.where(df['usable']==0, np.NaN, df['infected'])
# Make a infected3 with the extension of column1 (this is the extension up to the second)
df['infected4'] = df['infected3'].ffill(limit=1).bfill(limit=1)
# Remove the ones that should not have been extended (usable is 0)
df['infected5'] =  np.where(df['usable']==0, np.NaN, df['infected4'])

# Merge the original column with the result (without removing the infected )
df['infected'] = np.where(df['infected4'].isna(),df['infected'],df['infected5'])

# Print just for understanding of code
print(df)
# All above could be made in a for loop if you need more than 2 distance

# Cleanup the not needed columns
df = df.drop(['infected' str(x) for x in range(1,6)], axis=1)
print(df)

Output of last print is:

          Date  usable infected
0   13-12-2020       1    Case3
1   14-12-2020       1    Case3
2   15-12-2020       0      NaN
3   16-12-2020       1   Case33
4   17-12-2020       1   Case33
5   18-12-2020       1      NaN
6   19-12-2020       0      NaN
7   20-12-2020       0      NaN
8   21-12-2020       0    Case#
9   22-12-2020       1      NaN
10  23-12-2020       1      NaN
11  24-12-2020       0      NaN
12  25-12-2020       0      NaN
13  26-12-2020       1   Case46
14  27-12-2020       0      NaN
15  28-12-2020       1      NaN

CodePudding user response:

You can do this by using proper masks( the explanation is in comments):

# Make sure you select those rows whose previous row's usable is not 0 and it itself is usable==1 (satisfies your third condition)
m1 = (df["usable"].shift().ne(0)) & (df["usable"].eq(1))

# Also select the rows which are usable and have a valid infected value (the fill values)
m2 = (df["usable"].eq(1)) & (df["infected"].notna())

# Then select with m1 | m2 to ffill and then concat with remaining rows
out = pd.concat([df[~m1 & ~m2], df[m1 | m2].ffill(limit=2)]).sort_values(
    "Date", ignore_index=True
)

print (out)

          Date  usable infected
0   13-12-2020       1    Case3
1   14-12-2020       1    Case3
2   15-12-2020       0      NaN
3   16-12-2020       1   Case33
4   17-12-2020       1   Case33
5   18-12-2020       1   Case33
6   19-12-2020       0      NaN
7   20-12-2020       0      NaN
8   21-12-2020       0    Case#
9   22-12-2020       1      NaN
10  23-12-2020       1      NaN
11  24-12-2020       0      NaN
12  25-12-2020       0      NaN
13  26-12-2020       1   Case46
14  27-12-2020       0      NaN
15  28-12-2020       1      NaN

CodePudding user response:

Fill in the missing values in the infected column up to two entries on each side using the ffill() and bfill() methods:

# Fill in the missing values in the infected column
df['infected'] = df['infected'].ffill(limit=2).bfill(limit=2)

Set the values of the infected column to NaN where usable is 0 using the where() method:

# Set the values of the infected column to NaN where usable is 0
df['infected'] = df['infected'].where(df['usable'] == 1)
  • Related