How to select top N columns in a dataframe with a criteria-CodePudding

Here is my dataframe, it has high dimensionality (big number of columns) more than 10000 columns

The columns in my data are split into 3 categories

columns start with "Basic" columns end with "_T" and everything else

a sample of my dataframe is this

RowID   Basic1011 Basic2837 Lemon836 Car92_T Manf3953 Brat82 Basic383_T Jot112 ...
1       2         8         4        3       1        5      6          7
2       8         3         5        0       9        7      0          5

I want to have in my dataframe all "Basic" & "_T" columns and only TOP N (variable could be 3, 5, 10, 100, etc) of other columns

I have this code to give me top N for all columns. but what I am looking for just the top N for columns are not "Basic" or "_T"

and by Top I mean the greatest values

Top = 20
df = df.where(df.apply(lambda x: x.eq(x.nlargest(Top)), axis=1), 0)

How can I achieve that?

CodePudding user response：

Try something like this, you may have to play around with column selection on outset to be sure you're filtering correctly.

# this gives you column names with Basic or _T anywhere in the column name.
unwanted = df.filter(regex='Basic|_T').columns.tolist()

# the tilda takes the opposite of the criteria, so no Basic or _T
dfn = df[df.columns[~df.columns.isin(unwanted)]]

#apply your filter
Top = 2
df_ranked = dfn.where(dfn.apply(lambda x: x.eq(x.nlargest(Top)), axis=1), 0)

#then merge dfn with df_ranked

CodePudding user response：

Step 1: You can use .filter() with regex to filter the columns with the following 2 conditions:

start with "Basic", or
end with "_T"

The regex used is r'(?:^Basic)|(?:_T$)' where:

(?: ) is a non-capturing group of regex. It is for a temporary grouping.

^ is the start of text anchor to indicate start position of text

Basic matches with the text Basic (together with ^, this Basic must be at the beginning of column label)

| is the regex meta-character for or

_T matches the text _T

$ is the end of text anchor to indicate end of text position (together with _T, _T$ indicate _T at the end of column name.

We name these columns as cols_Basic_T

Step 2: Then, use Index.difference() to find others columns. We name these other columns as cols_others.

Step 3: Then, we apply the similar code you used to give you top N for all columns on these selected columns col_others.

Full set of codes:

## Step 1
cols_Basic_T = df.filter(regex=r'(?:^Basic)|(?:_T$)').columns

## Step 2
cols_others = df.columns.difference(cols_Basic_T)

## Step 3
#Top = 20 
Top = 3     # use fewer columns here for smaller sample data here
df_others = df[cols_others].where(df[cols_others].apply(lambda x: x.eq(x.nlargest(Top)), axis=1), 0)