I have a below dataframe, I am trying to get the column names that has empty strings in a most efficient way. The dataframe looks like below after doing df.head().
id type check company test
123 A Soft [[1649106820, 100029907158392,,,, 123]]
456 B Hard GMC [[1649106812, 100029907158312,,,, 456]]
I am trying to do it without using loops or in an efficient way Appreciate help with it
Expected output {company,test}
CodePudding user response:
Use apply()
to perform the test on all rows, then use .any()
to test if it's true for any row.
def empty_val(val):
if isinstance(val, list):
return any(str(item).strip() == "" for item in val)
else
return str(item).strip() == ""
empty_cols = [col for col in df if df[col].apply(empty_val).any()]
CodePudding user response:
Here's a way to find columns containing empty strings without testing individual value types:
empty_cols = {col for col in df.columns if df[col].explode().explode().astype(str).str.strip().eq('').any()}
Output:
{'company', 'test'}
Explanation:
- use
explode()
twice to ensure we've unpacked values that are of type list of lists, such as those in columntest
in the question - use
astype(str)
to convert numerical types to string, such as those in columnid
- use
str.strip().eq('')
to get a boolean Series for a given column indicating whether a value (possibly unpacked) is an empty string (after stripping spaces) - use
Series.any()
to reduce the boolean Series to a boolean value indicating whether the column has any empty strings - use a set comprehension to get a set containing the labels of all columns that contain empty strings (including after unpacking list-of-list values).