I am trying to add many columns to a pandas dataframe as follows:
def create_sum_rounds(df, col_name_base):
'''
Create a summed column in df from base columns. For example,
df['sum_foo'] = df['foo_1'] df['foo_2'] df['foo_3'] \
df['foo_4'] df['foo_5']
'''
out_name = 'sum_' col_name_base
df[out_name] = 0.0
for i in range(1, 6):
col_name = col_name_base str(i)
if col_name in df:
df[out_name] = df[col_name]
else:
logger.error('Col %s not in df' % col_name)
for col in sum_cols_list:
create_sum_rounds(df, col)
Where sum_cols_list
is a list of ~200 base column names (e.g. "foo"
), and df
is a pandas dataframe which includes the base columns extended with 1 through 5 (e.g. "foo_1", "foo_2", ..., "foo_5"
).
I'm getting a performance warning when I run this snippet:
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
I believe this is because creating a new column is actually calling an insert operation behind the scenes. What's the right way to use pd.concat in this case?
CodePudding user response:
You can use your same approach, but instead of operating directly on the DataFrame
, you'll need to store each output as its own pd.Series
. Then when all of the computations are done, use pd.concat
to glue everything back to your original DataFrame
.
(untested, but should work)
import pandas as pd
def create_sum_rounds(df, col_name_base):
'''
Create a summed column in df from base columns. For example,
df['sum_foo'] = df['foo_1'] df['foo_2'] df['foo_3'] \
df['foo_4'] df['foo_5']
'''
out = pd.Series(0, name='sum_' col_name_base, index=df.index)
for i in range(1, 6):
col_name = col_name_base str(i)
if col_name in df:
out = df[col_name]
else:
logger.error('Col %s not in df' % col_name)
col_sums = []
for col in sum_cols_list:
col_sums.append(create_sum_rounds(df, col))
new_df = pd.concat([df, *col_sums], axis=1)
Additionally, you can simplify your existing code (if you're willing to forego your logging)
import pandas as pd
def create_sum_rounds(df, col_name_base):
'''
Create a summed column in df from base columns. For example,
df['sum_foo'] = df['foo_1'] df['foo_2'] df['foo_3'] \
df['foo_4'] df['foo_5'] ...
'''
return df.filter(regex=f'{col_name_base}_\d ').sum(axis=1)
col_sums = []
for col in sum_cols_list:
col_sums.append(create_sum_rounds(df, col))
new_df = pd.concat([df, *col_sums], axis=1)
CodePudding user response:
Simplify :-)
def create_sum_rounds(df, col_name_base):
'''
Create a summed column in df from base columns. For example,
df['sum_foo'] = df['foo_1'] df['foo_2'] df['foo_3'] \
df['foo_4'] df['foo_5']
'''
out_name = 'sum_' col_name_base
df[out_name] = df.loc[:,[x for x in df.columns if x.startswith(col_name_base)]].sum(axis=1)
CodePudding user response:
Would this get you the results you are expecting?
df = pd.DataFrame({
'Foo_1' : [1, 2, 3, 4, 5],
'Foo_2' : [10, 20, 30, 40, 50],
'Something' : ['A', 'B', 'C', 'D', 'E']
})
df['Foo_Sum'] = df.filter(like = 'Foo_').sum(axis = 1)