I have a dataframe with multiple columns with numerical values. I wanted to new columns which compare the values of other columns and assign its column name as label. I already understood its logic in r, but wondering how should I do this easily in python. Can anyone point me out how this can be done in python when we try to add new column where need to compare value of multiple columns and assign column name which has max value? Any idea?
reproducible example
this is 100% working reproducible example in R:
library(data.table)
df <- data.frame(a = sample(seq(1:10), size=10), b = sample(LETTERS[1:10], size=10), cnt=sample(seq(1:100), size=5),
RECENT_MOV= sample(seq(1:1000), size = 10),
RETIRED= sample(seq(1:200), size = 10),
SERV_EMPL= sample(seq(1:500), size = 10),
SUB_BUS=sample(seq(1:2000), size = 10),
WORK_HOME=sample(seq(1:1200), size = 10)
)
dt <- as.data.table(df)
write.csv(dt, "sample.csv")
label = c("RECENT_MOV", "RETIRED", "SERV_EMPL", "SUB_BUS","WORK_HOME")
df$category <- NA_character_
df[, row_ind:= 1:nrow(df)]
df[cnt > 2, category := names(which.max(.SD[, label, with = FALSE])), by = row_ind]
current output is:
> dput(dt)
structure(list(a = c(5L, 10L, 1L, 6L, 7L, 3L, 2L, 8L, 4L, 9L),
b = c("E", "A", "D", "H", "J", "F", "G", "I", "C", "B"),
cnt = c(13L, 88L, 45L, 92L, 70L, 13L, 88L, 45L, 92L, 70L),
RECENT_MOV = c(70L, 195L, 620L, 572L, 354L, 648L, 798L, 657L,
233L, 672L), RETIRED = c(189L, 195L, 191L, 88L, 148L, 186L,
39L, 78L, 158L, 55L), SERV_EMPL = c(65L, 151L, 415L, 383L,
255L, 207L, 210L, 470L, 181L, 188L), SUB_BUS = c(894L, 829L,
1798L, 502L, 897L, 1461L, 744L, 1991L, 260L, 1697L), WORK_HOME = c(553L,
739L, 454L, 137L, 435L, 1042L, 316L, 697L, 517L, 1158L),
category = c("SUB_BUS", "SUB_BUS", "SUB_BUS", "RECENT_MOV",
"SUB_BUS", "SUB_BUS", "RECENT_MOV", "SUB_BUS", "WORK_HOME",
"SUB_BUS"), row_ind = 1:10), row.names = c(NA, -10L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000015a64b61ef0>)
my current python attempt
import pandas as pd
df=pd.read_csv("sample.csv", index_col=None, header=0)
label = ["RECENT_MOV", "RETIRED", "SERV_EMPL", "SUB_BUS","WORK_HOME"]
df['category'] = pd.NA
df['row_ind'] = range(1,len(df))
however, I have trouble to make this line in pythonic way:
df[cnt > 2, category := names(which.max(.SD[, label, with = FALSE])), by = row_ind]
basically, this lines said create new column variable called category where comparing columns in label where whichever column has max value, its column name will be assigned as value in category column. How should I do it this easily in python?
logic translation:
df[cnt > 2, category := names(which.max(.SD[, label, with = FALSE])), by = row_ind]
this line telling us that first do filter by cnt column where cnt > 2
, then compare columns values of df[["RECENT_MOV", "RETIRED", "SERV_EMPL", "SUB_BUS","WORK_HOME"]]
and pick the column with highest value by row-wise and assign that name of that column as value to df['category']=col_name_with_highest_value_in_each_row
.
desirable output
this is desirable output that I want to produce in python:
a b cnt RECENT_MOV RETIRED SERV_EMPL SUB_BUS WORK_HOME category row_ind
1 5 E 13 70 189 65 894 553 SUB_BUS 1
2 10 A 88 195 195 151 829 739 SUB_BUS 2
3 1 D 45 620 191 415 1798 454 SUB_BUS 3
4 6 H 92 572 88 383 502 137 RECENT_MOV 4
5 7 J 70 354 148 255 897 435 SUB_BUS 5
6 3 F 13 648 186 207 1461 1042 SUB_BUS 6
7 2 G 88 798 39 210 744 316 RECENT_MOV 7
8 8 I 45 657 78 470 1991 697 SUB_BUS 8
9 4 C 92 233 158 181 260 517 WORK_HOME 9
10 9 B 70 672 55 188 1697 1158 SUB_BUS 10
CodePudding user response:
This is actually really simple with pandas. Have a list of the columns to search in, and then use idxmax
with axis=1
:
# Filter out rows where `cnt` is less than or equal to 2
df = df[df['cnt'] > 2]
# Determine category for each row
search_cols = ['RECENT_MOV', 'RETIRED', 'SERV_EMPL', 'SUB_BUS', 'WORK_HOME']
df['category'] = df[search_cols].idxmax(axis=1)
# Assign row indexes
df['row_ind'] = df.index
Output:
>>> df
a b cnt RECENT_MOV RETIRED SERV_EMPL SUB_BUS WORK_HOME category row_ind
1 1 C 76 452 62 55 115 247 RECENT_MOV 1
2 7 E 14 50 165 337 1165 810 SUB_BUS 2
3 2 A 46 523 167 423 784 707 SUB_BUS 3
4 3 H 3 38 144 473 745 437 SUB_BUS 4
5 5 I 59 743 127 261 351 190 RECENT_MOV 5
6 8 J 76 143 49 470 1612 935 SUB_BUS 6
7 4 D 14 818 101 418 1919 314 SUB_BUS 7
8 6 F 46 714 9 446 1432 938 SUB_BUS 8
9 10 B 3 585 160 14 107 489 RECENT_MOV 9
10 9 G 59 814 73 449 937 287 SUB_BUS 10