Here is my dataframe, it has high dimensionality (big number of columns) more than 10000 columns
The columns in my data are split into 3 categories
columns start with "Basic" columns end with "_T" and everything else
a sample of my dataframe is this
RowID Basic1011 Basic2837 Lemon836 Car92_T Manf3953 Brat82 Basic383_T Jot112 ...
1 2 8 4 3 1 5 6 7
2 8 3 5 0 9 7 0 5
I want to have in my dataframe all "Basic" & "_T" columns and only TOP N (variable could be 3, 5, 10, 100, etc) of other columns
I have this code to give me top N for all columns. but what I am looking for just the top N for columns are not "Basic" or "_T"
and by Top I mean the greatest values
Top = 20
df = df.where(df.apply(lambda x: x.eq(x.nlargest(Top)), axis=1), 0)
How can I achieve that?
CodePudding user response:
Try something like this, you may have to play around with column selection on outset to be sure you're filtering correctly.
# this gives you column names with Basic or _T anywhere in the column name.
unwanted = df.filter(regex='Basic|_T').columns.tolist()
# the tilda takes the opposite of the criteria, so no Basic or _T
dfn = df[df.columns[~df.columns.isin(unwanted)]]
#apply your filter
Top = 2
df_ranked = dfn.where(dfn.apply(lambda x: x.eq(x.nlargest(Top)), axis=1), 0)
#then merge dfn with df_ranked
CodePudding user response:
Step 1: You can use .filter()
with regex to filter the columns with the following 2 conditions:
- start with "Basic", or
- end with "_T"
The regex used is r'(?:^Basic)|(?:_T$)'
where:
(?: )
is a non-capturing group of regex. It is for a temporary grouping.
^
is the start of text anchor to indicate start position of text
Basic
matches with the text Basic
(together with ^
, this Basic
must be at the beginning of column label)
|
is the regex meta-character for or
_T
matches the text _T
$
is the end of text anchor to indicate end of text position (together with _T
, _T$
indicate _T
at the end of column name.
We name these columns as cols_Basic_T
Step 2: Then, use Index.difference()
to find others columns. We name these other columns as cols_others
.
Step 3: Then, we apply the similar code you used to give you top N for all columns on these selected columns col_others
.
Full set of codes:
## Step 1
cols_Basic_T = df.filter(regex=r'(?:^Basic)|(?:_T$)').columns
## Step 2
cols_others = df.columns.difference(cols_Basic_T)
## Step 3
#Top = 20
Top = 3 # use fewer columns here for smaller sample data here
df_others = df[cols_others].where(df[cols_others].apply(lambda x: x.eq(x.nlargest(Top)), axis=1), 0)