Home > other >  Replace specific values in a dataframe by column mean in pandas
Replace specific values in a dataframe by column mean in pandas

Time:07-08

I'm a python beginner and I'm trying to do some operations with dataframes that I usually do with R language.

I Have a large dataframe with 2592 rows and 205 columns and I want to replace the 0.0 values by half the minimum value of its column.

An example with a random dataframe would be:

>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(1)
>>> df = pd.DataFrame(np.random.randint(0,10, size=(3,5)), columns = ['A', 'B', 'C', 'D', 'E'])
>>> print(df)
   A  B  C  D  E
0  5  8  9  5  0
1  0  1  7  6  9
2  2  4  5  2  4

And the result I'm looking for is:

   A  B  C  D  E
0  5  8  9  5  2
1  1  1  7  6  9
2  2  4  5  2  4

Intuitively I would do it like this:

>>> for column in df:
        for element in column:
            if element == 0:
                element = df[column].min()/2

But it doesn't work... any help?

Thank you!

CodePudding user response:

Use DataFrame.mask with replace minimum values without 0 divide by 2:

df1 = df.mask(df.eq(0), df.replace(0, np.nan).min().div(2), axis=1)
print(df1)
   A  B  C  D  E
0  5  8  9  5  2
1  1  1  7  6  9
2  2  4  5  2  4

For more efficient solution is possible use (thanks @mozway):

m = df.eq(0) 
df1 = df.mask(m, df[~m].min().div(2), axis=1)

CodePudding user response:

To work on your "intuitive" way, this is how to do it.

Use a function to perform the fancy logics you need. Pandas has .apply function is optimised, so it should be sufficiently fast anyway.

import pandas as pd
import numpy as np
np.random.seed(1)
df = pd.DataFrame(np.random.randint(0,10, size=(3,5)), columns = ['A', 'B', 'C', 'D', 'E'])

def make_half_minimum(value, dataseries):
    if value == 0:
        dataseries_ = dataseries[dataseries!=0]
        return dataseries_.min()/2
    else:
        return value

for column_name in df.columns:
    df[column_name] = df[column_name].apply(lambda x: make_half_minimum(x,df[column_name]))

print(df)
     A  B  C  D    E
0  5.0  8  9  5  2.0
1  1.0  1  7  6  9.0
2  2.0  4  5  2  4.0
[Finished in 521ms]
  • Related