Pandas Dataframe - Replacing None-like Values with None in All Columns-CodePudding

I need to clean up a dataframe whose columns come from different sources and have different types. This means that I can have, for example, string columns that contain "nan", "none", "NULL", (as a string instead of a None value).

My goal is to find all empty values and replace them with None. This works fine:

for column in df.columns:
    for idx, row in df.iterrows():
        if (str(row[column]).lower() == "none") or if (str(row[column]).lower() == "nan") or (str(row[column]).lower() == "null"):
            df.at[row.name, column] = None

But it is obviously not the best or fastest way to do it. How can I take advantage of Pandas operations or list comprehensions to do this substitution? Thanks!

CodePudding user response：

Simple approach, use isin and mask:

df = pd.DataFrame([[1,2,'nan'],
                   ['none',3,'NULL']])

df_clean = df.mask(df.isin(["nan", "none", "NULL"]))

Or, if you want to update in place:

df[df.isin(["nan", "none", "NULL"])] = float('nan')

Output:

     0  1    2
0    1  2  NaN
1  NaN  3  NaN

CodePudding user response：

If you want to use numpy you could do this as well (if the values in the fields are truly a string)

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'name' : ['one', 'two', 'one', 'two'],
    'A' : ['null', 'none', 'empty', 'Keep']
})

df['A'] = np.where(df['A'].isin(['null', 'none', 'empty']), '', df['A'])
df

CodePudding user response：

A quick, and easy optimization:

for column in df.columns:
    for idx, row in df.iterrows():
        col = str(row[column]).lower()
        if (col == "none") or if (col == "nan") or (col == "null"):
            df.at[row.name, column] = None

No need to convert row[column] to a str and then iterate over each character 3 times.

Shorter code:

its_none = ['none', 'nan', 'null']
for column in df.columns:
    for idx, row in df.iterrows():
        if str(row[column]).lower() in its_none:
            df.at[row.name, column] = None

Even shorter (I imagine you're expecting a number) and more optimized:

for column in df.columns:
    for idx, row in df.iterrows():
        if str(row[column]).lower().startswith('n'):
            df.at[row.name, column] = None

CodePudding user response：

This seems to be a somewhat controversial topic (see e.g. this thread) but it's often said that list comprehensions are more computationally efficient than for loops, especially when iterating over pandas dataframes.

I also prefer using list comprehensions stylistically as it leads to fewer levels of indentation from nested loops/if statements.

Here's what it looks like for your use case:

for column in df.columns:
    vals_list = my_series.to_list()
    replaced = [None if x.lower() in ['nan', 'none', 'null'] else x for x in vals_list]
    df[column] = replaced