I have a pandas dataframe like this:
thing1 thing2 num_col1 num_col2 num_col3 num_col4
aaa abc 0.0 99.76 101.1 111.2
bbb cde 11.3 109.76 201.1 121.2
ccc def 91.3 0.0 301.1 131.2
.....
.....
aaa efg 0.0 0.0 401.1 141.2
bbb fgh 41.3 299.76 0.0 151.2
ccc ghi 201.3 199.76 601.1 161.2
and I want to convert it to this:
thing1 thing2 num_col1 num_col2 num_col3 num_col4
aaa abc nan 99.76 101.1 111.2
bbb cde 11.3 109.76 201.1 121.2
ccc def 91.3 0.0 301.1 131.2
.....
.....
aaa efg nan nan 401.1 141.2
bbb fgh 41.3 299.76 0.0 151.2
ccc ghi 201.3 199.76 601.1 161.2
explanation: if, values of columns num_col1
, num_col2
, num_col3
and num_col4
start with 0, those 0s need to be converted to numpy.nan
and any 0s in the middle of the columns shouldn't be touched. How do I do this as efficiently as possible (by using any builtin funcitons and not using python's loops etc,.)?
CodePudding user response:
IIUC, you can create a mask dataframe for num_col
columns that is equal to zero and loop through the num_col
columns to check if values between num_col1
and the iterative num_col
column are all True.
cols = df.filter(like='num_col').columns
m = df[cols].eq(0)
for idx, col in enumerate(cols[1:]):
m[col] = m[cols[:idx 2]].all(axis=1)
df[cols] = df[cols].mask(m, np.nan)
print(df)
thing1 thing2 num_col1 num_col2 num_col3 num_col4
0 aaa abc NaN 99.76 101.1 111.2
1 bbb cde 11.3 109.76 201.1 121.2
2 ccc def 91.3 0.00 301.1 131.2
3 aaa efg NaN NaN 401.1 141.2
4 bbb fgh 41.3 299.76 0.0 151.2
5 ccc ghi 201.3 199.76 601.1 161.2
If you also allow zero starts from right, you can try
for idx, col in enumerate(cols[1:-1]):
m[col] = (m[cols[:idx 2]].all(axis=1)) | (m[cols[idx 1:]].all(axis=1))
CodePudding user response:
To get the starting 0, it usually involves cumsum on the negate condition. Something like this:
mask = (df.filter(like='num_') # extract numeric data, modify if needed
.ne(0) # where the numbers are not 0
.cumsum(axis=1) # these shows the blocks of `0`
.eq(0) # rows starting with zero
)
Then you would have the mask as:
num_col1 num_col2 num_col3 num_col4
0 True False False False
1 False False False False
2 False False False False
3 False False False False
4 False False False False
5 True True False False
6 False False False False
7 False False False False
Finally, you can use mask
to mask your data:
df[mask.columns] = df[mask.columns].mask(mask)
And you get (notice the 0
at line 5
, num_col4
):
thing1 thing2 num_col1 num_col2 num_col3 num_col4
0 aaa abc NaN 99.76 101.1 111.2
1 bbb cde 11.3 109.76 201.1 121.2
2 ccc def 91.3 0.00 301.1 131.2
3 ..... None NaN NaN NaN NaN
4 ..... None NaN NaN NaN NaN
5 aaa efg NaN NaN 401.1 0.0
6 bbb fgh 41.3 299.76 0.0 151.2
7 ccc ghi 201.3 199.76 601.1 161.2