working on NLP problem
I ended up with a big features dataset
dfMethod
Out[2]:
c0000167 c0000294 c0000545 ... c4721555 c4759703 c4759772
0 0 0 0 ... 0 0 0
1 0 0 0 ... 0 0 0
2 0 0 0 ... 0 0 0
3 0 0 0 ... 0 0 0
4 0 0 0 ... 0 0 0
... ... ... ... ... ... ...
3995 0 0 0 ... 0 0 0
3996 0 0 0 ... 0 0 0
3997 0 0 0 ... 0 0 0
3998 0 0 0 ... 0 0 0
3999 0 0 0 ... 0 0 0
[4000 rows x 14317 columns]
I want to remove columns with the smallest repetition (i.e. the columns with the smallest sum of of all records)
so if my columns sum would look like this
Sum of c0000167 = 7523
Sum of c0000294 = 8330
Sum of c0000545 = 502
Sum of c4721555 = 51
Sum of c4759703 = 9628
in the end, I want to only keep the top 5000 columns based on the sum of each column?
how can I do that?
CodePudding user response:
Let's say you have a big dataframe big_df
you can get the top columns with the following:
N = 5000
big_df[big_df.sum().sort_values(ascending=False).index[:N]]
Breaking this down:
big_df.sum() # Gives the sums you mentioned
.sort_values(ascending=False) # Sort the sums in descending order
.index # because .sum() defaults to axis=0, the index is your columns
[:N] # grab first N items
CodePudding user response:
Edited after author comment. Let's consider df a pandas DataFrame. Preparing the filter, select top 5000 sum columns:
df_sum = df.sum() # avoid repeating df.sum() next line
co = sorted([(c, v) for (c, v) in list(zip(df_sum.keys(), df_sum.values))], key = lambda row: row[1], reverse = True)[0:5000]
# fixed trouble of sum value greater than 5000, but the top 5000.
co = [row[0] for row in co]
# convert to a list of column names of interest
After filter columns in co:
df = df.filter(items = co)
df