Home > Mobile >  Replace data in the whole Dataframe with a condition
Replace data in the whole Dataframe with a condition

Time:08-17

I want to replace every element in a pandas dataframe with an empty string (all columns and all records) if they contain a question mark. I am curious what is a best solution for this.

What I thought of is to write a loop like this:

def modify_dataframe_line_by_line(df) -> None:

    for index, record in df.iterrows():
        for colname in df.columns.tolist():
            if "?" in record[colname]:
                record[colname] = ""

It works, but I assume this will be slow as hell with larger datasets.

I also tried this one but it does not work:

def df_loc_replace(df) -> None:

    for colname in df.columns.tolist():
        df.loc["?" in df[colname], colname] = ""

I also tried df.replace() but I did not find an option to add conditions to that (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html)

What is the best solution to this?

CodePudding user response:

The accepted solution won't work for DataFrames containing numeric/datetime/categorical data types. It will cause the AttributeError: Can only use .str accessor with string values! exception.

Here is a more generic and vectorized solution which should work also for DataFrames containing numeric/datetime/categorical data types.

demo:

df = pd.DataFrame.from_dict(
{'c1': {0: 1, 1: 2, 2: 3},
 'c2': {0: 'aaa', 1: 'what?', 2: 'another string'},
 'c3': {0: 10, 1: 11, 2: 12},
 'c4': {0: 'str', 1: np.nan, 2: '??'}
})
df

output:

In [77]: df
Out[77]:
   c1              c2  c3   c4
0   1             aaa  10  str
1   2           what?  11  NaN
2   3  another string  12   ??

solution:

str_cols = df.columns[df.dtypes.eq("object")]
df.loc[:, str_cols] = df[str_cols].replace(r".*\?.*", "", regex=True)

result:

In [80]: df
Out[80]:
   c1              c2  c3   c4
0   1             aaa  10  str
1   2                  11  NaN
2   3  another string  12

NOTE: this solution won't work properly in case your DataFrame contains non-scalar values in cells (which is not recommended by the Pandas developers), because such columns would also have object dtype.

CodePudding user response:

Try this:

import numpy as np
for colname in df.columns.tolist():
    df[colname] = np.where(df[colname].str.contains('\?'), '', df[colname])
  • Related