Home > front end >  Pandas - Apply a function on a comma-separated column of feature names and store the weights in sepa
Pandas - Apply a function on a comma-separated column of feature names and store the weights in sepa

Time:10-18

Consider the following dataframe df in which the feature column is string of comma separated feature names in a dataset (df can be potentially large).

index    features
1        'f1'  
2        'f1, f2'
3        'f1, f2, f3'

I also have a function get_weights that accepts a comma-separated string of feature names and calculates and returns a list that contains a weight for each given weight. The implementation details are not important and for the sake of simplicity, let's consider that the function returns equal weights for each feature:

import numpy as np
def get_weights(features):
   features = features.split(', ')
   return np.ones(len(features)) / len(features)

Using pandas, how can I apply the get_weights on df and have the results in a new dataframe as below:

index   f1     f2    f3 
1       1      0      0
2       0.5    0.5    0
3       0.33   0.33   0.33

That is, in the resulting dataframe, the features in df.features are turned into columns that contain the weight for that feature per row.

CodePudding user response:

Option 1

Consindering that the goal is to apply the function to the dataframe features, one can use pandas.Series.apply as follows

df = df['features'].apply(lambda x: pd.Series(get_weights(x)))

[Out]:

          0         1         2
0  1.000000       NaN       NaN
1  0.500000  0.500000       NaN
2  0.333333  0.333333  0.333333

However, in order to obtain the desired output, there are still a few things one has to do.

First, adjust the previous operation to fill the NaN with 0

df = df['features'].apply(lambda x: pd.Series(get_weights(x))).fillna(0)

[Out]:

          0         1         2
0  1.000000  0.000000  0.000000
1  0.500000  0.500000  0.000000
2  0.333333  0.333333  0.333333

Second, one wants the name of the columns to be, respectively, f1, f2, and f3. For that, one can do the following

df = df['features'].apply(lambda x: pd.Series(get_weights(x))).fillna(0).rename(columns={0: 'f1', 1: 'f2', 2: 'f3'})

[Out]:

         f1        f2        f3
0  1.000000  0.000000  0.000000
1  0.500000  0.500000  0.000000
2  0.333333  0.333333  0.333333

Now, starting from this previous operation, as it is missing the column index starting at 1, one can simply do the following

df['index'] = df.index   1

[Out]:

   index        f1        f2        f3
0      1  1.000000  0.000000  0.000000
1      2  0.500000  0.500000  0.000000
2      3  0.333333  0.333333  0.333333

Finally, if the goal is to make the index column the index of the dataframe, one can use pandas.DataFrame.set_index as follows

df = df.set_index('index')

[Out]:

             f1        f2        f3
index                              
1      1.000000  0.000000  0.000000
2      0.500000  0.500000  0.000000
3      0.333333  0.333333  0.333333

Option 2

If one doesn't want to use .apply() (as per the first Note below), another option, and a one-liner that satisfies all the requirements, would be to create a new dataframe as follows

df_new = pd.DataFrame([get_weights(x) for x in df['features']]).fillna(0).rename(columns={0: 'f1', 1: 'f2', 2: 'f3'}).set_index(pd.Series(range(1, len(df) 1), name='index'))

[Out]:

             f1        f2        f3
index                              
1      1.000000  0.000000  0.000000
2      0.500000  0.500000  0.000000
3      0.333333  0.333333  0.333333

Notes:

CodePudding user response:

You can use:

df2 = (pd.DataFrame([get_weights(s) for s in df['features']], index=df.index)
         .fillna(0).rename(columns=lambda x: f'f{x 1}')
       )
out = df.drop(columns='features').join(df2)

output:

   index        f1        f2        f3
0      1  1.000000  0.000000  0.000000
1      2  0.500000  0.500000  0.000000
2      3  0.333333  0.333333  0.333333

CodePudding user response:

Using the function get_dummies from pandas you can do:

# 0- Let's define an example pandas DataFrame:

df = pd.DataFrame(
    {
        "features": ["f1", "f1, f2", "f1, f2, f3", "f1, f4"]
    }
)

# 1- Convert column of strings into Series of lists:

aux_series = df["features"].str.split(", ")

# 2- Use get_dummies function, transpose the result and fill NaN's

aux_df = pd.concat([pd.get_dummies(aux_series[i]).sum() for i in df.index], axis=1).T.fillna(0)

# 3- Get the 'weight' of each value diving by its row summatory

output_df = aux_df.div(aux_df.sum(axis=1), axis=0)

# 4- Print the result:

print(output_df)

[Out]:

         f1        f2        f3   f4
0  1.000000  0.000000  0.000000  0.0
1  0.500000  0.500000  0.000000  0.0
2  0.333333  0.333333  0.333333  0.0
3  0.500000  0.000000  0.000000  0.5
  • Related