Python newbie! I have a data frame (csv file) with around 30 columns. I am trying to get the list of the columns in the data frame which have all the values as 0. I have gone through few examples of how to iterate over all the columns in the data frame from here: https://sparkbyexamples.com/pandas/pandas-iterate-over-columns-of-dataframe-to-run-regression/ but unable to figure out a way to print all the column names which have "All" values as 0. I want to get the idea of the columns so that i can take the next steps appropriately.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(color_codes = True)
pd.set_option('display.max_columns', None)
# Load loan.csv
data = pd.read_csv('loan.csv',keep_default_na=False)
CodePudding user response:
all_zero_cols = df.columns[df.eq(0).all()].tolist()
- query if equal to 0 or not
- collapse per column with
all
semantic - gives a True/False Series where index is the column names
- mask the columns with that
(if one wants to check against multiple values than only 0, there's .isin
, e.g., replacing .eq(0)
with .isin([0, 1])
.)
Sample run:
In [135]: df
Out[135]:
item month sales
0 0 1 0
1 0 2 0
2 0 3 0
3 0 2 0
4 0 0 0
5 0 3 0
6 0 4 0
7 0 0 0
In [136]: df.eq(0)
Out[136]:
item month sales
0 True False True
1 True False True
2 True False True
3 True False True
4 True True True
5 True False True
6 True False True
7 True True True
In [137]: df.eq(0).all()
Out[137]:
item True
month False
sales True
dtype: bool
In [138]: is_all_zero = df.eq(0).all()
In [139]: df.columns[is_all_zero].tolist()
Out[139]: ["item", "sales"]
(somewhat changed the code inspired from @Kelvin (thanks) but the logic remains the same. On a 800_000 x 75 frame, this code (or the previos version equally) seems to be ~80 times faster than the apply-based solution...
In [169]: df
Out[169]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ... 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 ... 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0
1 0 2 0 0 2 0 0 2 0 0 2 0 0 2 0 0 2 0 ... 0 2 0 0 2 0 0 2 0 0 2 0 0 2 0 0 2 0
2 0 3 0 0 3 0 0 3 0 0 3 0 0 3 0 0 3 0 ... 0 3 0 0 3 0 0 3 0 0 3 0 0 3 0 0 3 0
3 0 2 0 0 2 0 0 2 0 0 2 0 0 2 0 0 2 0 ... 0 2 0 0 2 0 0 2 0 0 2 0 0 2 0 0 2 0
4 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 ... 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0
... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
799995 0 2 0 0 2 0 0 2 0 0 2 0 0 2 0 0 2 0 ... 0 2 0 0 2 0 0 2 0 0 2 0 0 2 0 0 2 0
799996 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 ... 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0
799997 0 3 0 0 3 0 0 3 0 0 3 0 0 3 0 0 3 0 ... 0 3 0 0 3 0 0 3 0 0 3 0 0 3 0 0 3 0
799998 0 4 0 0 4 0 0 4 0 0 4 0 0 4 0 0 4 0 ... 0 4 0 0 4 0 0 4 0 0 4 0 0 4 0 0 4 0
799999 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[800000 rows x 75 columns]
In [170]: %timeit df.columns[df.apply(lambda x: sum(x) == 0).values].tolist()
4.7 s ± 477 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [171]: %timeit df.columns[df.eq(0).all()]
57.8 ms ± 1.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [172]: %timeit df.eq(0).all().loc[lambda m: m].index.tolist()
54.2 ms ± 941 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [173]: 4.7 * 1000 / 57.8
Out[173]: 81.31487889273357
CodePudding user response:
As a one liner:
df[df.columns[df.apply(lambda x: sum(x) == 0).values]]
The performance seems to be better on this solution (disclaimer: tested on tiny dataset):
%%timeit
df[df.columns[df.apply(lambda x: sum(x) == 0).values]]
1.5 ms ± 63.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
df.eq(0).all().loc[lambda m: m].index.tolist()
2.34 ms ± 85.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
CodePudding user response:
Try this:
for col in data.columns:
if len(data) == len(data[data[col]==0]):
print(col)
CodePudding user response:
Not a performant solution but should be fine for your use-case. Iterates over each column, creates a set of the values in that column, prints out the column header if the value set matches a set with just zero in it:-
df = pd.DataFrame([
[1,1,1,1,0,1],
[0,1,1,1,0,0],
[1,1,0,0,0,1],
[1,1,0,1,0,1],
[1,1,0,1,0,1]
], columns=['A', 'B', 'C', 'D', 'E', 'F'])
for col in df.columns:
if set(df[col].values) == {0}:
print(col)
Can this be extrapolated to find columns having values as 0 or 1?
This modification works to find the columns that are entirely made of the integer 0 or columns made entirely of the interger 1.
for col in df.columns:
if (set(df[col].values) == {0}) or (set(df[col].values) == {1}):
print(col)