I have a large data frame with many columns say 130 columns & datetime in milliseconds as index. I want to make some columns values empty for now. I don't want to delete those columns as I may use it in future.

I tried 2 methods

Trial 1: using "" - but it converts column as string

# Make not used columns as nan (dummy)
def make_not_used_columns_nan (df):
    dummy_cols = [0, 2, 3, 4, 5, 8, 9, 14, 28, 29, 31, 32, 33, 34, 35, 36, 37, 38, 39,
                  40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 51, 52, 53, 54, 57, 58, 59, 63,
                  64, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
                  85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 99, 107,
                  108, 111, 112, 113, 114, 115, 116, 117, 118, 124, 125, 126, 127, 128, 129]
    df[dummy_cols] = ""
    return df

df = make_not_used_columns_nan(df)

Trial 2: using np.nan method

def make_not_used_columns_nan (df):
    dummy_cols = [0, 2, 3, 4, 5, 8, 9, 14, 28, 29, 31, 32, 33, 34, 35, 36, 37, 38, 39,
                  40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 51, 52, 53, 54, 57, 58, 59, 63,
                  64, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
                  85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 99, 107,
                  108, 111, 112, 113, 114, 115, 116, 117, 118, 124, 125, 126, 127, 128, 129]
    df[dummy_cols] = np.NaN
    df[dummy_cols] = df[dummy_cols].astype('Int32')
    return df

df = make_not_used_columns_nan(df)

Inital df

DatetimeIndex: 4515 entries, 2022-07-20 09:02:31.120000 to 2022-07-20 11:02:20.817000
Columns: 130 entries, 0 to 129
dtypes: int16(17), int8(113)
memory usage: 683.4 KB

Trial 1 df - using ""

DatetimeIndex: 4515 entries, 2022-07-20 09:02:31.120000 to 2022-07-20 11:02:20.817000
Columns: 130 entries, 0 to 129
dtypes: int16(17), int8(29), object(84)
memory usage: 3.2  MB

Trial 2 df - using np.nan

DatetimeIndex: 4515 entries, 2022-07-20 09:02:31.120000 to 2022-07-20 11:02:20.817000
Columns: 130 entries, 0 to 129
dtypes: Int32(84), int16(17), int8(29)
memory usage: 2.1 MB

I would like to know which is the best way to empty a column and at the same time keep memory low?

CodePudding user response：

The lowest memory usage I could find is using categorical data:

df[dummy_cols] = np.NaN
df[dummy_cols] = df[dummy_cols].astype('category')

Example based on your data:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.full((4515, 130), 1, dtype=np.int8),
                  index=np.linspace(0, 1, 4515, dtype='datetime64[ms]'))
df.iloc[:,-17:] = df.iloc[:,-17:].astype(np.int16)
df.info()
# dtypes: int16(17), int8(113)
# memory usage: 683.4 KB

df.iloc[:,:84] = np.nan
df.iloc[:,:84] = df.iloc[:,:84].astype('category')
df.info()
# dtypes: category(84), int16(17), int8(29)
# memory usage: 692.3 KB

CodePudding user response：

As stated in the numpy documentation an ndarray is a one dimensional array (filling up a space equal to the amount of data) an indexing scheme (with a fixed size). So the best you can do is minimize it to 1 byte dtype, like bool or byte.

As a side note- keep in mind that changing the size of a column may be pretty slow if you need to reposition every entry. I would suggest you to check if it actually helps you to save this memory since you will need to repopulate it later