I want to replace every element in a pandas dataframe with an empty string (all columns and all records) if they contain a question mark. I am curious what is a best solution for this.
What I thought of is to write a loop like this:
def modify_dataframe_line_by_line(df) -> None:
for index, record in df.iterrows():
for colname in df.columns.tolist():
if "?" in record[colname]:
record[colname] = ""
It works, but I assume this will be slow as hell with larger datasets.
I also tried this one but it does not work:
def df_loc_replace(df) -> None:
for colname in df.columns.tolist():
df.loc["?" in df[colname], colname] = ""
I also tried df.replace()
but I did not find an option to add conditions to that (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html)
What is the best solution to this?
CodePudding user response:
The accepted solution won't work for DataFrames containing numeric/datetime/categorical data types. It will cause the AttributeError: Can only use .str accessor with string values!
exception.
Here is a more generic and vectorized solution which should work also for DataFrames containing numeric/datetime/categorical data types.
demo:
df = pd.DataFrame.from_dict(
{'c1': {0: 1, 1: 2, 2: 3},
'c2': {0: 'aaa', 1: 'what?', 2: 'another string'},
'c3': {0: 10, 1: 11, 2: 12},
'c4': {0: 'str', 1: np.nan, 2: '??'}
})
df
output:
In [77]: df
Out[77]:
c1 c2 c3 c4
0 1 aaa 10 str
1 2 what? 11 NaN
2 3 another string 12 ??
solution:
str_cols = df.columns[df.dtypes.eq("object")]
df.loc[:, str_cols] = df[str_cols].replace(r".*\?.*", "", regex=True)
result:
In [80]: df
Out[80]:
c1 c2 c3 c4
0 1 aaa 10 str
1 2 11 NaN
2 3 another string 12
NOTE: this solution won't work properly in case your DataFrame contains non-scalar values in cells (which is not recommended by the Pandas developers), because such columns would also have object
dtype.
CodePudding user response:
Try this:
import numpy as np
for colname in df.columns.tolist():
df[colname] = np.where(df[colname].str.contains('\?'), '', df[colname])