I need to clean up a dataframe whose columns come from different sources and have different types. This means that I can have, for example, string columns that contain "nan", "none", "NULL", (as a string instead of a None value).
My goal is to find all empty values and replace them with None. This works fine:
for column in df.columns:
for idx, row in df.iterrows():
if (str(row[column]).lower() == "none") or if (str(row[column]).lower() == "nan") or (str(row[column]).lower() == "null"):
df.at[row.name, column] = None
But it is obviously not the best or fastest way to do it. How can I take advantage of Pandas operations or list comprehensions to do this substitution? Thanks!
CodePudding user response:
Simple approach, use isin
and mask
:
df = pd.DataFrame([[1,2,'nan'],
['none',3,'NULL']])
df_clean = df.mask(df.isin(["nan", "none", "NULL"]))
Or, if you want to update in place:
df[df.isin(["nan", "none", "NULL"])] = float('nan')
Output:
0 1 2
0 1 2 NaN
1 NaN 3 NaN
CodePudding user response:
If you want to use numpy you could do this as well (if the values in the fields are truly a string)
import pandas as pd
import numpy as np
df = pd.DataFrame({
'name' : ['one', 'two', 'one', 'two'],
'A' : ['null', 'none', 'empty', 'Keep']
})
df['A'] = np.where(df['A'].isin(['null', 'none', 'empty']), '', df['A'])
df
CodePudding user response:
A quick, and easy optimization:
for column in df.columns:
for idx, row in df.iterrows():
col = str(row[column]).lower()
if (col == "none") or if (col == "nan") or (col == "null"):
df.at[row.name, column] = None
No need to convert row[column]
to a str
and then iterate over each character 3 times.
Shorter code:
its_none = ['none', 'nan', 'null']
for column in df.columns:
for idx, row in df.iterrows():
if str(row[column]).lower() in its_none:
df.at[row.name, column] = None
Even shorter (I imagine you're expecting a number) and more optimized:
for column in df.columns:
for idx, row in df.iterrows():
if str(row[column]).lower().startswith('n'):
df.at[row.name, column] = None
CodePudding user response:
This seems to be a somewhat controversial topic (see e.g. this thread) but it's often said that list comprehensions are more computationally efficient than for loops, especially when iterating over pandas dataframes.
I also prefer using list comprehensions stylistically as it leads to fewer levels of indentation from nested loops/if statements.
Here's what it looks like for your use case:
for column in df.columns:
vals_list = my_series.to_list()
replaced = [None if x.lower() in ['nan', 'none', 'null'] else x for x in vals_list]
df[column] = replaced