I'm new to Dask and the manner in which columns are dropped is confusing to me. I've read a csv file into the Dask dataframe. Then suppose I have this:
print(len(columns_to_drop)) # There are 66
print(len(list(df.columns))) # The Dask columns before the drop
df.drop(columns_to_drop, axis=1).compute(). # Drop the columns
pd_df = df.compute() # Create a Pandas dataframe
print(pd_df.shape[1]) # Pandas dataframe columns
print(len(list(df.columns))) # The Dask columns after the drop
What I get from the print statements:
- 66 columns to drop
- 207 Dask df columns before the drop
- 207 Pandas column count
- 207 Dask column after the drop
CodePudding user response:
You need to add inplace=True
to drop()
, because by default it return a copy of the original dataframe with the specified columns removed.
df.drop(columns_to_drop, axis=1, inplace=True).compute()
CodePudding user response:
Assuming that the dataframe fits into memory, this should do the trick:
df = df.drop(columns_to_drop, axis=1). # Drop the columns
pd_df = df.compute() # Create a Pandas dataframe