Home > Back-end >  Pandas DataFrame: Same operation on multiple sets of columns
Pandas DataFrame: Same operation on multiple sets of columns

Time:03-23

I want to do the same operation on multiple sets of columns of a DataFrame.

Since "for-loops" are frowned upon I'm searching for a decent alternative.

An example:

df = pd.DataFrame({
    'a': [1, 11, 111],
    'b': [222, 22, 2],
    'a_x': [10, 80, 30],
    'b_x': [20, 20, 60],
})

This is a simple for-loop approach. It's short and well readable.

cols = ['a', 'b']
for col in cols:
    df[f'{col}_res'] = df[[col, f'{col}_x']].min(axis=1)
    a   b   a_x b_x a_res b_res
0   1   222 10  20  1     20
1   11  22  80  20  11    20
2   111 2   30  60  30    2

This is an alternative (w/o for-loop), but I feel that the additional complexity is not really for the better.

cols = ['a', 'b']
def res_df(df, col, name):
    res = pd.Series(
        df[[col, f'{col}_x']].min(axis=1), index=df.index, name=name)
    return res

res = [res_df(df, col, f'{col}_res') for col in cols]
df = pd.concat([df, pd.concat(res, axis=1)], axis=1)

Does anyone have a better/more pythonic solution?
Thanks!


UPDATE 1
Inspired by the proposal from mozway I find the following solution quite appealing.
Imho it's short, readable and generic, since the particular operation can be swapped into a function and the list comprehension applies the function to the given sets of columns.

def operation(s1, s2):
    # fill in any operation on pandas series'
    # e.g. res = s1 * s2 / (s1   s2)
    res = np.minimum(s1, s2)
    return res
df = df.join(
    [operation(df[f'{col}'], df[f'{col}_x']).rename(f'{col}_res') for col in cols]
)

CodePudding user response:

You can use numpy.minimum after setting the arrays to identical column names:

cols = ['a', 'b']
cols2 = [f'{x}_x' for x in cols]
df = df.join(np.minimum(df[cols],
                        df[cols2].set_axis(cols, axis=1))
               .add_suffix('_res'))

output:

     a    b  a_x  b_x  a_res  b_res
0    1  222   10   20      1     20
1   11   22   80   20     11     20
2  111    2   30   60     30      2

or, using rename as suggested in the other answer:

cols = ['a', 'b']
cols2 = {f'{x}_x': x for x in cols}
df = df.join(np.minimum(df[cols],
                        df[list(cols2)].rename(columns=cols2))
               .add_suffix('_res'))

CodePudding user response:

One idea is rename columns names by dictionary, select columns by list cols and then group by columns names with aggregate min, sum, max or use custom function:

cols = ['a', 'b']
suffix = '_x'

d = {f'{x}{suffix}':x for x in cols}
print (d)
{'a_x': 'a', 'b_x': 'b'}

print (df.rename(columns=d)[cols])
     a   a    b   b
0    1  10  222  20
1   11  80   22  20
2  111  30    2  60

df1 = df.rename(columns=d)[cols].groupby(axis=1,level=0).min().add_suffix('_res')
print (df1)
   a_res  b_res
0      1     20
1     11     20
2     30      2

Last add to original DataFrame:

df = df.join(df1)
print (df)
     a    b  a_x  b_x  a_res  b_res
0    1  222   10   20      1     20
1   11   22   80   20     11     20
2  111    2   30   60     30      2
  • Related