I have a decent set of data (37509, 166). I am currently trying to replace 0 in several columns based on a set of conditions. I continued to get a memory error until I changed that value, and now my kernel keeps crashing. My questions is, is there a better way to write this code that avoids memory problems?
df = pd.read_csv(".csv")
cols = list(df.select_dtypes(include=[np.number]).columns)
mask = (df["column1"] <= 0) & (df["column2"] == 0)
df.loc[mask, df[cols]] = np.nan
The two columns used for the mask are not included in the cols list and I've tried 1 column at a time. I run into MemoryError every time. I've tried running it through Terality with the same issue.
The error is:
MemoryError: Unable to allocate 10.5 GiB for an array with shape (37509, 37509) and data type float64.
The following code does not work either (I understand why this code won't work with the copy vs view) for the list of columns or individual column:
df[mask][cols].replace(0, np.nan, inplace=True)
If anyone would be willing to help explain a solution or even just explain the problem, I would greatly appreciate it.
CodePudding user response:
DataFrame.loc
accepts either booleans or labels:
Access a group of rows and columns by label(s) or a boolean array.
Currently the column indexer is an entire dataframe df[cols]
:
df.loc[mask, df[cols]] = np.nan
# ^^^^^^^^
Instead of df[cols]
, use just the cols
list:
df.loc[mask, cols] = np.nan
# ^^^^