Python Pandas: How to print all columns in the data frame where all the values are 0-CodePudding

Python newbie! I have a data frame (csv file) with around 30 columns. I am trying to get the list of the columns in the data frame which have all the values as 0. I have gone through few examples of how to iterate over all the columns in the data frame from here: https://sparkbyexamples.com/pandas/pandas-iterate-over-columns-of-dataframe-to-run-regression/ but unable to figure out a way to print all the column names which have "All" values as 0. I want to get the idea of the columns so that i can take the next steps appropriately.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


%matplotlib inline
sns.set(color_codes = True)
pd.set_option('display.max_columns', None)

# Load loan.csv
data = pd.read_csv('loan.csv',keep_default_na=False)

CodePudding user response：

all_zero_cols = df.columns[df.eq(0).all()].tolist()

query if equal to 0 or not
collapse per column with all semantic
gives a True/False Series where index is the column names
mask the columns with that

(if one wants to check against multiple values than only 0, there's .isin, e.g., replacing .eq(0) with .isin([0, 1]).)

Sample run:

In [135]: df
Out[135]:
   item  month  sales
0     0      1      0
1     0      2      0
2     0      3      0
3     0      2      0
4     0      0      0
5     0      3      0
6     0      4      0
7     0      0      0

In [136]: df.eq(0)
Out[136]:
   item  month  sales
0  True  False   True
1  True  False   True
2  True  False   True
3  True  False   True
4  True   True   True
5  True  False   True
6  True  False   True
7  True   True   True

In [137]: df.eq(0).all()
Out[137]:
item      True
month    False
sales     True
dtype: bool

In [138]: is_all_zero = df.eq(0).all()

In [139]: df.columns[is_all_zero].tolist()
Out[139]: ["item", "sales"]

(somewhat changed the code inspired from @Kelvin (thanks) but the logic remains the same. On a 800_000 x 75 frame, this code (or the previos version equally) seems to be ~80 times faster than the apply-based solution...

In [169]: df
Out[169]:
        0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  ...  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74
0        0   1   0   0   1   0   0   1   0   0   1   0   0   1   0   0   1   0  ...   0   1   0   0   1   0   0   1   0   0   1   0   0   1   0   0   1   0
1        0   2   0   0   2   0   0   2   0   0   2   0   0   2   0   0   2   0  ...   0   2   0   0   2   0   0   2   0   0   2   0   0   2   0   0   2   0
2        0   3   0   0   3   0   0   3   0   0   3   0   0   3   0   0   3   0  ...   0   3   0   0   3   0   0   3   0   0   3   0   0   3   0   0   3   0
3        0   2   0   0   2   0   0   2   0   0   2   0   0   2   0   0   2   0  ...   0   2   0   0   2   0   0   2   0   0   2   0   0   2   0   0   2   0
4        0   1   0   0   1   0   0   1   0   0   1   0   0   1   0   0   1   0  ...   0   1   0   0   1   0   0   1   0   0   1   0   0   1   0   0   1   0
...     ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ...  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..
799995   0   2   0   0   2   0   0   2   0   0   2   0   0   2   0   0   2   0  ...   0   2   0   0   2   0   0   2   0   0   2   0   0   2   0   0   2   0
799996   0   1   0   0   1   0   0   1   0   0   1   0   0   1   0   0   1   0  ...   0   1   0   0   1   0   0   1   0   0   1   0   0   1   0   0   1   0
799997   0   3   0   0   3   0   0   3   0   0   3   0   0   3   0   0   3   0  ...   0   3   0   0   3   0   0   3   0   0   3   0   0   3   0   0   3   0
799998   0   4   0   0   4   0   0   4   0   0   4   0   0   4   0   0   4   0  ...   0   4   0   0   4   0   0   4   0   0   4   0   0   4   0   0   4   0
799999   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  ...   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

[800000 rows x 75 columns]

In [170]: %timeit df.columns[df.apply(lambda x: sum(x) == 0).values].tolist()
4.7 s ± 477 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [171]: %timeit df.columns[df.eq(0).all()]
57.8 ms ± 1.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [172]: %timeit df.eq(0).all().loc[lambda m: m].index.tolist()
54.2 ms ± 941 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [173]: 4.7 * 1000 / 57.8
Out[173]: 81.31487889273357

CodePudding user response：

As a one liner:

df[df.columns[df.apply(lambda x: sum(x) == 0).values]]

The performance seems to be better on this solution (disclaimer: tested on tiny dataset):

%%timeit
df[df.columns[df.apply(lambda x: sum(x) == 0).values]]
1.5 ms ± 63.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
df.eq(0).all().loc[lambda m: m].index.tolist()
2.34 ms ± 85.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

CodePudding user response：

Try this:

for col in data.columns:
    if len(data) == len(data[data[col]==0]):
        print(col)

CodePudding user response：

Not a performant solution but should be fine for your use-case. Iterates over each column, creates a set of the values in that column, prints out the column header if the value set matches a set with just zero in it:-

df = pd.DataFrame([
    [1,1,1,1,0,1],
    [0,1,1,1,0,0],
    [1,1,0,0,0,1],
    [1,1,0,1,0,1],
    [1,1,0,1,0,1]
], columns=['A', 'B', 'C', 'D', 'E', 'F'])


for col in df.columns:
    if set(df[col].values) == {0}:
        print(col)

Can this be extrapolated to find columns having values as 0 or 1?

This modification works to find the columns that are entirely made of the integer 0 or columns made entirely of the interger 1.

for col in df.columns:
    if (set(df[col].values) == {0}) or (set(df[col].values) == {1}):
        print(col)