Consider the following dataframe df
in which the feature
column is string of comma separated feature names in a dataset (df
can be potentially large).
index features
1 'f1'
2 'f1, f2'
3 'f1, f2, f3'
I also have a function get_weights
that accepts a comma-separated string of feature names and calculates and returns a list that contains a weight for each given weight. The implementation details are not important and for the sake of simplicity, let's consider that the function returns equal weights for each feature:
import numpy as np
def get_weights(features):
features = features.split(', ')
return np.ones(len(features)) / len(features)
Using pandas, how can I apply the get_weights
on df
and have the results in a new dataframe as below:
index f1 f2 f3
1 1 0 0
2 0.5 0.5 0
3 0.33 0.33 0.33
That is, in the resulting dataframe, the features in df.features
are turned into columns that contain the weight for that feature per row.
CodePudding user response:
Option 1
Consindering that the goal is to apply the function to the dataframe features, one can use pandas.Series.apply
as follows
df = df['features'].apply(lambda x: pd.Series(get_weights(x)))
[Out]:
0 1 2
0 1.000000 NaN NaN
1 0.500000 0.500000 NaN
2 0.333333 0.333333 0.333333
However, in order to obtain the desired output, there are still a few things one has to do.
First, adjust the previous operation to fill the NaN
with 0
df = df['features'].apply(lambda x: pd.Series(get_weights(x))).fillna(0)
[Out]:
0 1 2
0 1.000000 0.000000 0.000000
1 0.500000 0.500000 0.000000
2 0.333333 0.333333 0.333333
Second, one wants the name of the columns to be, respectively, f1
, f2
, and f3
. For that, one can do the following
df = df['features'].apply(lambda x: pd.Series(get_weights(x))).fillna(0).rename(columns={0: 'f1', 1: 'f2', 2: 'f3'})
[Out]:
f1 f2 f3
0 1.000000 0.000000 0.000000
1 0.500000 0.500000 0.000000
2 0.333333 0.333333 0.333333
Now, starting from this previous operation, as it is missing the column index
starting at 1
, one can simply do the following
df['index'] = df.index 1
[Out]:
index f1 f2 f3
0 1 1.000000 0.000000 0.000000
1 2 0.500000 0.500000 0.000000
2 3 0.333333 0.333333 0.333333
Finally, if the goal is to make the index column the index of the dataframe, one can use pandas.DataFrame.set_index
as follows
df = df.set_index('index')
[Out]:
f1 f2 f3
index
1 1.000000 0.000000 0.000000
2 0.500000 0.500000 0.000000
3 0.333333 0.333333 0.333333
Option 2
If one doesn't want to use .apply()
(as per the first Note below), another option, and a one-liner that satisfies all the requirements, would be to create a new dataframe as follows
df_new = pd.DataFrame([get_weights(x) for x in df['features']]).fillna(0).rename(columns={0: 'f1', 1: 'f2', 2: 'f3'}).set_index(pd.Series(range(1, len(df) 1), name='index'))
[Out]:
f1 f2 f3
index
1 1.000000 0.000000 0.000000
2 0.500000 0.500000 0.000000
3 0.333333 0.333333 0.333333
Notes:
- There are strong opinions on using
.apply()
. Would recommend reading this: When should I (not) want to use pandas apply() in my code?
CodePudding user response:
You can use:
df2 = (pd.DataFrame([get_weights(s) for s in df['features']], index=df.index)
.fillna(0).rename(columns=lambda x: f'f{x 1}')
)
out = df.drop(columns='features').join(df2)
output:
index f1 f2 f3
0 1 1.000000 0.000000 0.000000
1 2 0.500000 0.500000 0.000000
2 3 0.333333 0.333333 0.333333
CodePudding user response:
Using the function get_dummies from pandas you can do:
# 0- Let's define an example pandas DataFrame:
df = pd.DataFrame(
{
"features": ["f1", "f1, f2", "f1, f2, f3", "f1, f4"]
}
)
# 1- Convert column of strings into Series of lists:
aux_series = df["features"].str.split(", ")
# 2- Use get_dummies function, transpose the result and fill NaN's
aux_df = pd.concat([pd.get_dummies(aux_series[i]).sum() for i in df.index], axis=1).T.fillna(0)
# 3- Get the 'weight' of each value diving by its row summatory
output_df = aux_df.div(aux_df.sum(axis=1), axis=0)
# 4- Print the result:
print(output_df)
[Out]:
f1 f2 f3 f4
0 1.000000 0.000000 0.000000 0.0
1 0.500000 0.500000 0.000000 0.0
2 0.333333 0.333333 0.333333 0.0
3 0.500000 0.000000 0.000000 0.5