How to simplify data.table logic and make it doable in pandas?-CodePudding

I have a dataframe with multiple columns with numerical values. I wanted to new columns which compare the values of other columns and assign its column name as label. I already understood its logic in r, but wondering how should I do this easily in python. Can anyone point me out how this can be done in python when we try to add new column where need to compare value of multiple columns and assign column name which has max value? Any idea?

reproducible example

this is 100% working reproducible example in R:

library(data.table)

df <- data.frame(a = sample(seq(1:10), size=10), b = sample(LETTERS[1:10], size=10), cnt=sample(seq(1:100), size=5),
                 RECENT_MOV= sample(seq(1:1000), size = 10),
                 RETIRED= sample(seq(1:200), size = 10),
                 SERV_EMPL= sample(seq(1:500), size = 10),
                 SUB_BUS=sample(seq(1:2000), size = 10),
                 WORK_HOME=sample(seq(1:1200), size = 10)
                 )

dt <- as.data.table(df)
write.csv(dt, "sample.csv")
label = c("RECENT_MOV", "RETIRED", "SERV_EMPL", "SUB_BUS","WORK_HOME")
df$category <- NA_character_
df[, row_ind:= 1:nrow(df)]
df[cnt > 2, category := names(which.max(.SD[, label, with = FALSE])), by = row_ind]

current output is:

> dput(dt)
structure(list(a = c(5L, 10L, 1L, 6L, 7L, 3L, 2L, 8L, 4L, 9L), 
    b = c("E", "A", "D", "H", "J", "F", "G", "I", "C", "B"), 
    cnt = c(13L, 88L, 45L, 92L, 70L, 13L, 88L, 45L, 92L, 70L), 
    RECENT_MOV = c(70L, 195L, 620L, 572L, 354L, 648L, 798L, 657L, 
    233L, 672L), RETIRED = c(189L, 195L, 191L, 88L, 148L, 186L, 
    39L, 78L, 158L, 55L), SERV_EMPL = c(65L, 151L, 415L, 383L, 
    255L, 207L, 210L, 470L, 181L, 188L), SUB_BUS = c(894L, 829L, 
    1798L, 502L, 897L, 1461L, 744L, 1991L, 260L, 1697L), WORK_HOME = c(553L, 
    739L, 454L, 137L, 435L, 1042L, 316L, 697L, 517L, 1158L), 
    category = c("SUB_BUS", "SUB_BUS", "SUB_BUS", "RECENT_MOV", 
    "SUB_BUS", "SUB_BUS", "RECENT_MOV", "SUB_BUS", "WORK_HOME", 
    "SUB_BUS"), row_ind = 1:10), row.names = c(NA, -10L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x0000015a64b61ef0>)

my current python attempt

import pandas as pd

df=pd.read_csv("sample.csv", index_col=None, header=0)

label = ["RECENT_MOV", "RETIRED", "SERV_EMPL", "SUB_BUS","WORK_HOME"]
df['category'] = pd.NA
df['row_ind'] = range(1,len(df))

however, I have trouble to make this line in pythonic way:

df[cnt > 2, category := names(which.max(.SD[, label, with = FALSE])), by = row_ind]

basically, this lines said create new column variable called category where comparing columns in label where whichever column has max value, its column name will be assigned as value in category column. How should I do it this easily in python?

logic translation:

df[cnt > 2, category := names(which.max(.SD[, label, with = FALSE])), by = row_ind]

this line telling us that first do filter by cnt column where cnt > 2, then compare columns values of df[["RECENT_MOV", "RETIRED", "SERV_EMPL", "SUB_BUS","WORK_HOME"]] and pick the column with highest value by row-wise and assign that name of that column as value to df['category']=col_name_with_highest_value_in_each_row.

desirable output

this is desirable output that I want to produce in python:

    a b cnt RECENT_MOV RETIRED SERV_EMPL SUB_BUS WORK_HOME   category row_ind
1   5 E  13         70     189        65     894       553    SUB_BUS       1
2  10 A  88        195     195       151     829       739    SUB_BUS       2
3   1 D  45        620     191       415    1798       454    SUB_BUS       3
4   6 H  92        572      88       383     502       137 RECENT_MOV       4
5   7 J  70        354     148       255     897       435    SUB_BUS       5
6   3 F  13        648     186       207    1461      1042    SUB_BUS       6
7   2 G  88        798      39       210     744       316 RECENT_MOV       7
8   8 I  45        657      78       470    1991       697    SUB_BUS       8
9   4 C  92        233     158       181     260       517  WORK_HOME       9
10  9 B  70        672      55       188    1697      1158    SUB_BUS      10

CodePudding user response：

This is actually really simple with pandas. Have a list of the columns to search in, and then use idxmax with axis=1:

# Filter out rows where `cnt` is less than or equal to 2
df = df[df['cnt'] > 2]

# Determine category for each row
search_cols = ['RECENT_MOV', 'RETIRED', 'SERV_EMPL', 'SUB_BUS', 'WORK_HOME']
df['category'] = df[search_cols].idxmax(axis=1)

# Assign row indexes
df['row_ind'] = df.index

Output:

>>> df
     a  b  cnt  RECENT_MOV  RETIRED  SERV_EMPL  SUB_BUS  WORK_HOME    category  row_ind
1    1  C   76         452       62         55      115        247  RECENT_MOV        1
2    7  E   14          50      165        337     1165        810     SUB_BUS        2
3    2  A   46         523      167        423      784        707     SUB_BUS        3
4    3  H    3          38      144        473      745        437     SUB_BUS        4
5    5  I   59         743      127        261      351        190  RECENT_MOV        5
6    8  J   76         143       49        470     1612        935     SUB_BUS        6
7    4  D   14         818      101        418     1919        314     SUB_BUS        7
8    6  F   46         714        9        446     1432        938     SUB_BUS        8
9   10  B    3         585      160         14      107        489  RECENT_MOV        9
10   9  G   59         814       73        449      937        287     SUB_BUS       10