How to reduce the size of my dataframe in Python?-CodePudding

working on NLP problem

I ended up with a big features dataset

dfMethod
Out[2]: 
      c0000167  c0000294  c0000545  ...  c4721555  c4759703  c4759772
0            0         0         0  ...         0         0         0
1            0         0         0  ...         0         0         0
2            0         0         0  ...         0         0         0
3            0         0         0  ...         0         0         0
4            0         0         0  ...         0         0         0
       ...       ...       ...  ...       ...       ...       ...
3995         0         0         0  ...         0         0         0
3996         0         0         0  ...         0         0         0
3997         0         0         0  ...         0         0         0
3998         0         0         0  ...         0         0         0
3999         0         0         0  ...         0         0         0

[4000 rows x 14317 columns]

I want to remove columns with the smallest repetition (i.e. the columns with the smallest sum of of all records)

so if my columns sum would look like this

Sum of c0000167 = 7523
Sum of c0000294 = 8330
Sum of c0000545 = 502
Sum of c4721555 = 51
Sum of c4759703 = 9628

in the end, I want to only keep the top 5000 columns based on the sum of each column?

how can I do that?

CodePudding user response：

Let's say you have a big dataframe big_df you can get the top columns with the following:

N = 5000
big_df[big_df.sum().sort_values(ascending=False).index[:N]]

Breaking this down:

big_df.sum()  # Gives the sums you mentioned
.sort_values(ascending=False)  # Sort the sums in descending order
.index  # because .sum() defaults to axis=0, the index is your columns
[:N]  # grab first N items

CodePudding user response：

Edited after author comment. Let's consider df a pandas DataFrame. Preparing the filter, select top 5000 sum columns:

df_sum = df.sum() # avoid repeating df.sum() next line
co = sorted([(c, v) for (c, v) in list(zip(df_sum.keys(), df_sum.values))], key = lambda row: row[1], reverse = True)[0:5000]
# fixed trouble of sum value greater than 5000, but the top 5000.
co = [row[0] for row in co]
# convert to a list of column names of interest

After filter columns in co:

df = df.filter(items = co)
df