I have a large data frame with many columns say 130 columns & datetime in milliseconds as index. I want to make some columns values empty for now. I don't want to delete those columns as I may use it in future.
I tried 2 methods
Trial 1: using "" - but it converts column as string
# Make not used columns as nan (dummy)
def make_not_used_columns_nan (df):
dummy_cols = [0, 2, 3, 4, 5, 8, 9, 14, 28, 29, 31, 32, 33, 34, 35, 36, 37, 38, 39,
40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 51, 52, 53, 54, 57, 58, 59, 63,
64, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 99, 107,
108, 111, 112, 113, 114, 115, 116, 117, 118, 124, 125, 126, 127, 128, 129]
df[dummy_cols] = ""
return df
df = make_not_used_columns_nan(df)
Trial 2: using np.nan method
def make_not_used_columns_nan (df):
dummy_cols = [0, 2, 3, 4, 5, 8, 9, 14, 28, 29, 31, 32, 33, 34, 35, 36, 37, 38, 39,
40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 51, 52, 53, 54, 57, 58, 59, 63,
64, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 99, 107,
108, 111, 112, 113, 114, 115, 116, 117, 118, 124, 125, 126, 127, 128, 129]
df[dummy_cols] = np.NaN
df[dummy_cols] = df[dummy_cols].astype('Int32')
return df
df = make_not_used_columns_nan(df)
Inital df
DatetimeIndex: 4515 entries, 2022-07-20 09:02:31.120000 to 2022-07-20 11:02:20.817000
Columns: 130 entries, 0 to 129
dtypes: int16(17), int8(113)
memory usage: 683.4 KB
Trial 1 df - using ""
DatetimeIndex: 4515 entries, 2022-07-20 09:02:31.120000 to 2022-07-20 11:02:20.817000
Columns: 130 entries, 0 to 129
dtypes: int16(17), int8(29), object(84)
memory usage: 3.2 MB
Trial 2 df - using np.nan
DatetimeIndex: 4515 entries, 2022-07-20 09:02:31.120000 to 2022-07-20 11:02:20.817000
Columns: 130 entries, 0 to 129
dtypes: Int32(84), int16(17), int8(29)
memory usage: 2.1 MB
I would like to know which is the best way to empty a column and at the same time keep memory low?
CodePudding user response:
The lowest memory usage I could find is using categorical data:
df[dummy_cols] = np.NaN
df[dummy_cols] = df[dummy_cols].astype('category')
Example based on your data:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.full((4515, 130), 1, dtype=np.int8),
index=np.linspace(0, 1, 4515, dtype='datetime64[ms]'))
df.iloc[:,-17:] = df.iloc[:,-17:].astype(np.int16)
df.info()
# dtypes: int16(17), int8(113)
# memory usage: 683.4 KB
df.iloc[:,:84] = np.nan
df.iloc[:,:84] = df.iloc[:,:84].astype('category')
df.info()
# dtypes: category(84), int16(17), int8(29)
# memory usage: 692.3 KB
CodePudding user response:
As stated in the numpy documentation an ndarray is a one dimensional array (filling up a space equal to the amount of data) an indexing scheme (with a fixed size). So the best you can do is minimize it to 1 byte dtype, like bool
or byte
.
As a side note- keep in mind that changing the size of a column may be pretty slow if you need to reposition every entry. I would suggest you to check if it actually helps you to save this memory since you will need to repopulate it later