I have a pandas dataframe(df) which I want to index to only display columns where the total sum within the column is not zero. I am using the .to_numpy().nonzero() method to create a tuple of non-zero indexes. I checked the pandas.DataFrame.iloc documentation and found that only arrays / lists of int are available for indexing, so I change this tuple of non-zero indexes to a list:
import pandas as pd
#...
df = pd.read_table(f)
df_sum = df.sum(axis = 0)
df_no_0_tuple = df_sum.to_numpy().nonzero() #--> prints (array([ 7, 8, 9, 10, 11, 25, 26, 27, 28, 29, 31, 32, 36], dtype=int64),)
print(type(df_no_0_tuple)) #--> prints "<class 'tuple'>"
df_no_0 = list(df_no_0_tuple)
print(type(df_no_0)) #--> prints "<class 'list'>"
print(df_no_0) #--> prints [array([ 7, 8, 9, 10, 11, 25, 26, 27, 28, 29, 31, 32, 36], dtype=int64)]
df_final = df.iloc[:, df_no_0]
print[df_final]
As mentioned in the title: If I use the df_no_0 as iloc input I get the error: "ValueError: Buffer has wrong number of dimensions (expected 1, got 2)" Question here is: Does the df_no_0 list also include the part "dtype=int64" which then causes the dimensions error (expected 1, got 2)? If so, is there a way to remove or not even create the type information when using the list conversion?
If I use the tuple directly from to_numpy().nonzero() I get the error: "pandas.core.indexing.IndexingError: Too many indexers". I think the different error here might be caused because there are now three indexers separated by commata "," within the tuple compared to the list. The question remains for me: How can I index the dataframe correctly using the output of to_numpy().nonzero() or how can I transform that output to be a viable input for the .iloc indexing?
BTW: If I just enter the output of to_numpy().nonzero() in list format manually, the indexing works as expected. This will just be tedious for indexing multiple files without previously knowing their non-zero columns.
Any help is greatly appreciated!
Thank you in advance!
CodePudding user response:
Could something like this work for you?
Example of dataframe where some columns sum up to zero:
df = pd.DataFrame([[1, 3, 0, 2, 0],
[3, 4, 0, 4, 0],
[5, 1, 0, 3, 0]],
columns = ['a', 'b', 'c', 'd', 'e'])
To remove the columns that only add up to zero:
df.loc[:, (df.sum() != 0)]