Home > Software engineering >  Replace DataFrame rows with missing string values with the row mean
Replace DataFrame rows with missing string values with the row mean

Time:03-01

I have a Dataframe as follow. Column "b" of the DataFrame has string value with maximum of 5 numbers. So, I want to consider the missing values in the rows which they has less than 5 numbers. For example, the second rows has 2 numbers, I fill the row with mean of 4 and 6. And also for the third row, I want the same thing.

import pandas as pd
df = pd.DataFrame()
df['a'] = [ 1,  2, 3 ]
df['b'] = [ '2, 3, 4, 5,6' , '4,6', '  6,8']

The data frame which I have

I here is the DataFrame which I want.

enter image description here

CodePudding user response:

Here's one approach:

Splitting on comma and expanding creates a DataFrame; then mask the rows with NaN values and fill them with the mean:

tmp = df['b'].str.split(',', expand=True).astype(float)
df[[f'b{i}' for i in range(1,tmp.shape[1] 1)]] = (tmp.mask(tmp.isna().any(axis=1))
                                                  .T.fillna(tmp.mean(axis=1)).T
                                                  .astype(int))
df = df.drop(columns='b')

Output:

   a  b1  b2  b3  b4  b5
0  1   2   3   4   5   6
1  2   5   5   5   5   5
2  3   7   7   7   7   7

CodePudding user response:

we apply eval to column b, replace short tuples with tuples of size 5 filled with averages, expand into columns and rename columns

df2 = (df['b'].apply(eval)
              .apply(lambda t: t if len(t)==5 else (sum(t)/len(t),)*5)
              .apply(pd.Series)
              .rename(columns = lambda n:f'b{n 1}')
    )

we then join it to the original column a

df[['a']].join(df2)

output:


    a   b1  b2  b3  b4  b5
0   1   2.0 3.0 4.0 5.0 6.0
1   2   5.0 5.0 5.0 5.0 5.0
2   3   7.0 7.0 7.0 7.0 7.0
  • Related