Home > database >  efficiently convert leading zeroes in each row to numpy.nan in pandas
efficiently convert leading zeroes in each row to numpy.nan in pandas

Time:06-09

I have a pandas dataframe like this:

   thing1  thing2   num_col1    num_col2   num_col3    num_col4
    aaa     abc      0.0         99.76     101.1        111.2
    bbb     cde      11.3        109.76    201.1        121.2
    ccc     def      91.3        0.0       301.1        131.2
   .....
   .....
    aaa     efg      0.0         0.0       401.1        141.2
    bbb     fgh      41.3        299.76    0.0          151.2
    ccc     ghi      201.3       199.76    601.1        161.2

and I want to convert it to this:

   thing1  thing2   num_col1    num_col2   num_col3    num_col4
    aaa     abc      nan         99.76     101.1        111.2
    bbb     cde      11.3        109.76    201.1        121.2
    ccc     def      91.3        0.0       301.1        131.2
   .....
   .....
    aaa     efg      nan         nan       401.1        141.2
    bbb     fgh      41.3        299.76    0.0          151.2
    ccc     ghi      201.3       199.76    601.1        161.2

explanation: if, values of columns num_col1, num_col2, num_col3 and num_col4 start with 0, those 0s need to be converted to numpy.nan and any 0s in the middle of the columns shouldn't be touched. How do I do this as efficiently as possible (by using any builtin funcitons and not using python's loops etc,.)?

CodePudding user response:

IIUC, you can create a mask dataframe for num_col columns that is equal to zero and loop through the num_col columns to check if values between num_col1 and the iterative num_col column are all True.

cols = df.filter(like='num_col').columns
m = df[cols].eq(0)

for idx, col in enumerate(cols[1:]):
    m[col] = m[cols[:idx 2]].all(axis=1)

df[cols] = df[cols].mask(m, np.nan)
print(df)

  thing1 thing2  num_col1  num_col2  num_col3  num_col4
0    aaa    abc       NaN     99.76     101.1     111.2
1    bbb    cde      11.3    109.76     201.1     121.2
2    ccc    def      91.3      0.00     301.1     131.2
3    aaa    efg       NaN       NaN     401.1     141.2
4    bbb    fgh      41.3    299.76       0.0     151.2
5    ccc    ghi     201.3    199.76     601.1     161.2

If you also allow zero starts from right, you can try

for idx, col in enumerate(cols[1:-1]):
    m[col] = (m[cols[:idx 2]].all(axis=1)) | (m[cols[idx 1:]].all(axis=1))

CodePudding user response:

To get the starting 0, it usually involves cumsum on the negate condition. Something like this:

mask = (df.filter(like='num_') # extract numeric data, modify if needed
          .ne(0)               # where the numbers are not 0
          .cumsum(axis=1)      # these shows the blocks of `0`  
          .eq(0)               # rows starting with zero 
       )

Then you would have the mask as:

   num_col1  num_col2  num_col3  num_col4
0      True     False     False     False
1     False     False     False     False
2     False     False     False     False
3     False     False     False     False
4     False     False     False     False
5      True      True     False     False
6     False     False     False     False
7     False     False     False     False

Finally, you can use mask to mask your data:

df[mask.columns] = df[mask.columns].mask(mask)

And you get (notice the 0 at line 5, num_col4):

  thing1 thing2  num_col1  num_col2  num_col3  num_col4
0    aaa    abc       NaN     99.76     101.1     111.2
1    bbb    cde      11.3    109.76     201.1     121.2
2    ccc    def      91.3      0.00     301.1     131.2
3  .....   None       NaN       NaN       NaN       NaN
4  .....   None       NaN       NaN       NaN       NaN
5    aaa    efg       NaN       NaN     401.1       0.0
6    bbb    fgh      41.3    299.76       0.0     151.2
7    ccc    ghi     201.3    199.76     601.1     161.2
  • Related