import pandas as pd
import numpy as np
sample_data = [
{'Date': '13-12-2020', 'usable': 1, 'infected': 'Case3'},
{'Date': '14-12-2020', 'usable': 1},
{'Date': '15-12-2020', 'usable': 0},
{'Date': '16-12-2020', 'usable': 1, 'infected': 'Case33'},
{'Date': '17-12-2020', 'usable': 1},
{'Date': '18-12-2020', 'usable': 1},
{'Date': '19-12-2020', 'usable': 0},
{'Date': '20-12-2020', 'usable': 0},
{'Date': '21-12-2020', 'usable': 0, 'infected': 'Case#'},
{'Date': '22-12-2020', 'usable': 1},
{'Date': '23-12-2020', 'usable': 1},
{'Date': '24-12-2020', 'usable': 0},
{'Date': '25-12-2020', 'usable': 0},
{'Date': '26-12-2020', 'usable': 1, 'infected': 'Case46'},
{'Date': '27-12-2020', 'usable': 0},
{'Date': '28-12-2020', 'usable': 1},
]
df = pd.DataFrame(sample_data)
df['infected'] = df['infected'].ffill(limit=2).bfill(limit=2)
df['infected'] = np.where(df['usable']==0, np.NaN, df['infected'])
Above is my dataframe and how I am doing filling. Want to fill to upto 2 nearest location both forward and backward, subjected to following conditions:
- fill infected value only if it is usable=1
- fill to upto 2 nearest entry on each side (again only when usable = 1)
- want the filling to stop when you reach a zero
But, getting wrong output by below snippet:
df['infected'] = df['infected'].ffill(limit=2).bfill(limit=2)
df['infected'] = np.where(df['usable']==0, np.NaN, df['infected'])
Expected output:
expected = [
{'Date': '13-12-2020', 'usable': 1, 'infected': 'Case3'},
{'Date': '14-12-2020', 'usable': 1, 'infected': 'Case3'},
{'Date': '15-12-2020', 'usable': 0},
{'Date': '16-12-2020', 'usable': 1, 'infected': 'Case33'},
{'Date': '17-12-2020', 'usable': 1, 'infected': 'Case33'},
{'Date': '18-12-2020', 'usable': 1, 'infected': 'Case33'},
{'Date': '19-12-2020', 'usable': 0},
{'Date': '20-12-2020', 'usable': 0},
{'Date': '21-12-2020', 'usable': 0, 'infected': 'Case#'},
{'Date': '22-12-2020', 'usable': 1},
{'Date': '23-12-2020', 'usable': 1},
{'Date': '24-12-2020', 'usable': 0},
{'Date': '25-12-2020', 'usable': 0},
{'Date': '26-12-2020', 'usable': 1, 'infected': 'Case46'},
{'Date': '27-12-2020', 'usable': 0},
{'Date': '28-12-2020', 'usable': 1},
]
df_expected = pd.DataFrame(expected)
'28-12-2020' doesn't have infected as 'Case46' because 27-12-2020 has usable = 0 so can't transfer (ffill)
CodePudding user response:
A solution using bfill and ffill
import pandas as pd
import numpy as np
sample_data = [
{'Date': '13-12-2020', 'usable': 1, 'infected': 'Case3'},
{'Date': '14-12-2020', 'usable': 1},
{'Date': '15-12-2020', 'usable': 0},
{'Date': '16-12-2020', 'usable': 1, 'infected': 'Case33'},
{'Date': '17-12-2020', 'usable': 1},
{'Date': '18-12-2020', 'usable': 1},
{'Date': '19-12-2020', 'usable': 0},
{'Date': '20-12-2020', 'usable': 0},
{'Date': '21-12-2020', 'usable': 0, 'infected': 'Case#'},
{'Date': '22-12-2020', 'usable': 1},
{'Date': '23-12-2020', 'usable': 1},
{'Date': '24-12-2020', 'usable': 0},
{'Date': '25-12-2020', 'usable': 0},
{'Date': '26-12-2020', 'usable': 1, 'infected': 'Case46'},
{'Date': '27-12-2020', 'usable': 0},
{'Date': '28-12-2020', 'usable': 1},
]
df = pd.DataFrame(sample_data)
# Make a infected1 where we keep only the ones that we want to extend
df['infected1'] = np.where(df['usable']==0, np.NaN, df['infected'])
# Make a infected2 with the extension of column1
df['infected2'] = df['infected1'].ffill(limit=1).bfill(limit=1)
# Remove the ones that should not have been extended (usable is 0)
df['infected3'] = np.where(df['usable']==0, np.NaN, df['infected'])
# Make a infected3 with the extension of column1 (this is the extension up to the second)
df['infected4'] = df['infected3'].ffill(limit=1).bfill(limit=1)
# Remove the ones that should not have been extended (usable is 0)
df['infected5'] = np.where(df['usable']==0, np.NaN, df['infected4'])
# Merge the original column with the result (without removing the infected )
df['infected'] = np.where(df['infected4'].isna(),df['infected'],df['infected5'])
# Print just for understanding of code
print(df)
# All above could be made in a for loop if you need more than 2 distance
# Cleanup the not needed columns
df = df.drop(['infected' str(x) for x in range(1,6)], axis=1)
print(df)
Output of last print is:
Date usable infected
0 13-12-2020 1 Case3
1 14-12-2020 1 Case3
2 15-12-2020 0 NaN
3 16-12-2020 1 Case33
4 17-12-2020 1 Case33
5 18-12-2020 1 NaN
6 19-12-2020 0 NaN
7 20-12-2020 0 NaN
8 21-12-2020 0 Case#
9 22-12-2020 1 NaN
10 23-12-2020 1 NaN
11 24-12-2020 0 NaN
12 25-12-2020 0 NaN
13 26-12-2020 1 Case46
14 27-12-2020 0 NaN
15 28-12-2020 1 NaN
CodePudding user response:
You can do this by using proper masks( the explanation is in comments):
# Make sure you select those rows whose previous row's usable is not 0 and it itself is usable==1 (satisfies your third condition)
m1 = (df["usable"].shift().ne(0)) & (df["usable"].eq(1))
# Also select the rows which are usable and have a valid infected value (the fill values)
m2 = (df["usable"].eq(1)) & (df["infected"].notna())
# Then select with m1 | m2 to ffill and then concat with remaining rows
out = pd.concat([df[~m1 & ~m2], df[m1 | m2].ffill(limit=2)]).sort_values(
"Date", ignore_index=True
)
print (out)
Date usable infected
0 13-12-2020 1 Case3
1 14-12-2020 1 Case3
2 15-12-2020 0 NaN
3 16-12-2020 1 Case33
4 17-12-2020 1 Case33
5 18-12-2020 1 Case33
6 19-12-2020 0 NaN
7 20-12-2020 0 NaN
8 21-12-2020 0 Case#
9 22-12-2020 1 NaN
10 23-12-2020 1 NaN
11 24-12-2020 0 NaN
12 25-12-2020 0 NaN
13 26-12-2020 1 Case46
14 27-12-2020 0 NaN
15 28-12-2020 1 NaN
CodePudding user response:
Fill in the missing values in the infected column up to two entries on each side using the ffill() and bfill() methods:
# Fill in the missing values in the infected column
df['infected'] = df['infected'].ffill(limit=2).bfill(limit=2)
Set the values of the infected column to NaN where usable is 0 using the where() method:
# Set the values of the infected column to NaN where usable is 0
df['infected'] = df['infected'].where(df['usable'] == 1)