Home > Blockchain >  Python Pandas: How to print all columns in the data frame where all the values are 0
Python Pandas: How to print all columns in the data frame where all the values are 0

Time:09-04

Python newbie! I have a data frame (csv file) with around 30 columns. I am trying to get the list of the columns in the data frame which have all the values as 0. I have gone through few examples of how to iterate over all the columns in the data frame from here: https://sparkbyexamples.com/pandas/pandas-iterate-over-columns-of-dataframe-to-run-regression/ but unable to figure out a way to print all the column names which have "All" values as 0. I want to get the idea of the columns so that i can take the next steps appropriately.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


%matplotlib inline
sns.set(color_codes = True)
pd.set_option('display.max_columns', None)

# Load loan.csv
data = pd.read_csv('loan.csv',keep_default_na=False)

CodePudding user response:

all_zero_cols = df.columns[df.eq(0).all()].tolist()
  • query if equal to 0 or not
  • collapse per column with all semantic
  • gives a True/False Series where index is the column names
  • mask the columns with that

(if one wants to check against multiple values than only 0, there's .isin, e.g., replacing .eq(0) with .isin([0, 1]).)

Sample run:

In [135]: df
Out[135]:
   item  month  sales
0     0      1      0
1     0      2      0
2     0      3      0
3     0      2      0
4     0      0      0
5     0      3      0
6     0      4      0
7     0      0      0

In [136]: df.eq(0)
Out[136]:
   item  month  sales
0  True  False   True
1  True  False   True
2  True  False   True
3  True  False   True
4  True   True   True
5  True  False   True
6  True  False   True
7  True   True   True

In [137]: df.eq(0).all()
Out[137]:
item      True
month    False
sales     True
dtype: bool

In [138]: is_all_zero = df.eq(0).all()

In [139]: df.columns[is_all_zero].tolist()
Out[139]: ["item", "sales"]

(somewhat changed the code inspired from @Kelvin (thanks) but the logic remains the same. On a 800_000 x 75 frame, this code (or the previos version equally) seems to be ~80 times faster than the apply-based solution...

In [169]: df
Out[169]:
        0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  ...  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74
0        0   1   0   0   1   0   0   1   0   0   1   0   0   1   0   0   1   0  ...   0   1   0   0   1   0   0   1   0   0   1   0   0   1   0   0   1   0
1        0   2   0   0   2   0   0   2   0   0   2   0   0   2   0   0   2   0  ...   0   2   0   0   2   0   0   2   0   0   2   0   0   2   0   0   2   0
2        0   3   0   0   3   0   0   3   0   0   3   0   0   3   0   0   3   0  ...   0   3   0   0   3   0   0   3   0   0   3   0   0   3   0   0   3   0
3        0   2   0   0   2   0   0   2   0   0   2   0   0   2   0   0   2   0  ...   0   2   0   0   2   0   0   2   0   0   2   0   0   2   0   0   2   0
4        0   1   0   0   1   0   0   1   0   0   1   0   0   1   0   0   1   0  ...   0   1   0   0   1   0   0   1   0   0   1   0   0   1   0   0   1   0
...     ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ...  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..
799995   0   2   0   0   2   0   0   2   0   0   2   0   0   2   0   0   2   0  ...   0   2   0   0   2   0   0   2   0   0   2   0   0   2   0   0   2   0
799996   0   1   0   0   1   0   0   1   0   0   1   0   0   1   0   0   1   0  ...   0   1   0   0   1   0   0   1   0   0   1   0   0   1   0   0   1   0
799997   0   3   0   0   3   0   0   3   0   0   3   0   0   3   0   0   3   0  ...   0   3   0   0   3   0   0   3   0   0   3   0   0   3   0   0   3   0
799998   0   4   0   0   4   0   0   4   0   0   4   0   0   4   0   0   4   0  ...   0   4   0   0   4   0   0   4   0   0   4   0   0   4   0   0   4   0
799999   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  ...   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

[800000 rows x 75 columns]

In [170]: %timeit df.columns[df.apply(lambda x: sum(x) == 0).values].tolist()
4.7 s ± 477 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [171]: %timeit df.columns[df.eq(0).all()]
57.8 ms ± 1.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [172]: %timeit df.eq(0).all().loc[lambda m: m].index.tolist()
54.2 ms ± 941 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [173]: 4.7 * 1000 / 57.8
Out[173]: 81.31487889273357

CodePudding user response:

As a one liner:

df[df.columns[df.apply(lambda x: sum(x) == 0).values]]

The performance seems to be better on this solution (disclaimer: tested on tiny dataset):

%%timeit
df[df.columns[df.apply(lambda x: sum(x) == 0).values]]
1.5 ms ± 63.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
df.eq(0).all().loc[lambda m: m].index.tolist()
2.34 ms ± 85.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

CodePudding user response:

Try this:

for col in data.columns:
    if len(data) == len(data[data[col]==0]):
        print(col)

CodePudding user response:

Not a performant solution but should be fine for your use-case. Iterates over each column, creates a set of the values in that column, prints out the column header if the value set matches a set with just zero in it:-

df = pd.DataFrame([
    [1,1,1,1,0,1],
    [0,1,1,1,0,0],
    [1,1,0,0,0,1],
    [1,1,0,1,0,1],
    [1,1,0,1,0,1]
], columns=['A', 'B', 'C', 'D', 'E', 'F'])


for col in df.columns:
    if set(df[col].values) == {0}:
        print(col)

Can this be extrapolated to find columns having values as 0 or 1?

This modification works to find the columns that are entirely made of the integer 0 or columns made entirely of the interger 1.

for col in df.columns:
    if (set(df[col].values) == {0}) or (set(df[col].values) == {1}):
        print(col)
  • Related