I have a data frame with four columns that have values between 0-100. In a new column I want to assign a value dependant on the values within the first four columns.
The values from the first four columns will be assigned a number 0, 1 or 2 and then summed together as follows:
0 - 30 = 0
31 -70 = 1
71 - 100 = 2
So the maximum number in the fifth column will be 8 and the minimum 0.
In the example data frame below the fifth column should result in 2, 3. (Just in case I haven't described this clearly.)
I'm still very new with python and at this stage the only string that I have in my bow is a very long and cumbersome multiple nested if statement, followed with df['E'] = df.apply()
.
My question is what is the best and most efficient function/method for populating the fifth column.
data = {
'A': [50, 90],
'B': [2, 4],
'C': [20, 80],
'D': [75, 72],
}
df = pd.DataFrame(data)
CodePudding user response:
Edit
A more comprehensive method with np.select
:
condlist = [(0 <= df) & (df <= 30),
(31 <= df) & (df <= 70),
(71 <= df) & (df <= 100)]
choicelist = [0, 1, 2]
df['E'] = np.select(condlist, choicelist).sum(axis=1)
print(df)
# Output
A B C D E
0 50 2 20 75 3
1 90 4 80 72 6
Use pd.cut
after flatten your dataframe into one column with melt
:
df['E'] = pd.cut(pd.melt(df, ignore_index=False)['value'],
bins=[0, 30, 70, 100], labels=[0, 1, 2]) \
.cat.codes.groupby(level=0).sum()
print(df)
# Output:
A B C D E
0 50 2 20 75 3
1 90 4 80 72 6
Details:
>>> pd.melt(df, ignore_index=False)
variable value
0 A 50
1 A 90
0 B 2
1 B 4
0 C 20
1 C 80
0 D 75
1 D 72
>>> pd.cut(pd.melt(df, ignore_index=False)['value'],
bins=[0, 30, 70, 100], labels=[0, 1, 2])
0 1
1 2
0 0
1 0
0 0
1 2
0 2
1 2
Name: value, dtype: category
Categories (3, int64): [0 < 1 < 2]