Home > Enterprise >  Best Practice for Adding Lots of Columns to Pandas DataFrame
Best Practice for Adding Lots of Columns to Pandas DataFrame

Time:07-26

I am trying to add many columns to a pandas dataframe as follows:

def create_sum_rounds(df, col_name_base):
    '''
    Create a summed column in df from base columns. For example,
    df['sum_foo'] = df['foo_1']   df['foo_2']   df['foo_3']   \
                    df['foo_4']   df['foo_5']  
    '''
    out_name = 'sum_'   col_name_base
    df[out_name] = 0.0
    for i in range(1, 6):
        col_name = col_name_base   str(i)
        if col_name in df:
            df[out_name]  = df[col_name]
        else:
            logger.error('Col %s not in df' % col_name)

for col in sum_cols_list:
    create_sum_rounds(df, col)

Where sum_cols_list is a list of ~200 base column names (e.g. "foo"), and df is a pandas dataframe which includes the base columns extended with 1 through 5 (e.g. "foo_1", "foo_2", ..., "foo_5").

I'm getting a performance warning when I run this snippet:

PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

I believe this is because creating a new column is actually calling an insert operation behind the scenes. What's the right way to use pd.concat in this case?

CodePudding user response:

You can use your same approach, but instead of operating directly on the DataFrame, you'll need to store each output as its own pd.Series. Then when all of the computations are done, use pd.concat to glue everything back to your original DataFrame.

(untested, but should work)

import pandas as pd

def create_sum_rounds(df, col_name_base):
    '''
    Create a summed column in df from base columns. For example,
    df['sum_foo'] = df['foo_1']   df['foo_2']   df['foo_3']   \
                    df['foo_4']   df['foo_5']  
    '''
    out = pd.Series(0, name='sum_'   col_name_base, index=df.index)
    for i in range(1, 6):
        col_name = col_name_base   str(i)
        if col_name in df:
            out  = df[col_name]
        else:
            logger.error('Col %s not in df' % col_name)

col_sums = []
for col in sum_cols_list:
    col_sums.append(create_sum_rounds(df, col))
new_df = pd.concat([df, *col_sums], axis=1)

Additionally, you can simplify your existing code (if you're willing to forego your logging)

import pandas as pd

def create_sum_rounds(df, col_name_base):
    '''
    Create a summed column in df from base columns. For example,
    df['sum_foo'] = df['foo_1']   df['foo_2']   df['foo_3']   \
                    df['foo_4']   df['foo_5']   ...
    '''
    return df.filter(regex=f'{col_name_base}_\d ').sum(axis=1)

col_sums = []
for col in sum_cols_list:
    col_sums.append(create_sum_rounds(df, col))
new_df = pd.concat([df, *col_sums], axis=1)

CodePudding user response:

Simplify :-)

def create_sum_rounds(df, col_name_base):
    '''
    Create a summed column in df from base columns. For example,
    df['sum_foo'] = df['foo_1']   df['foo_2']   df['foo_3']   \
                    df['foo_4']   df['foo_5']  
    '''
    out_name = 'sum_'   col_name_base
    df[out_name] = df.loc[:,[x for x in df.columns if x.startswith(col_name_base)]].sum(axis=1)

CodePudding user response:

Would this get you the results you are expecting?

df = pd.DataFrame({
    'Foo_1' : [1, 2, 3, 4, 5],
    'Foo_2' : [10, 20, 30, 40, 50],
    'Something' : ['A', 'B', 'C', 'D', 'E']
})

df['Foo_Sum'] = df.filter(like = 'Foo_').sum(axis = 1)
  • Related