Home > database >  Apply function that operates on Pandas dataframes on a subset of the rows
Apply function that operates on Pandas dataframes on a subset of the rows

Time:11-09

I have a function that receives a dataframe and returns a new dataframe, which is the same but with some added columns. Just as an example:

def arbitrary_function_that_adds_columns(df):
    # In this trivial example I am adding only 1 column, but this function may add an arbitrary number of columns.
    df['new column'] = df['A']   df['B'] / 8   df['A']**3
    return df

To apply this function to a whole data frame is easy:

import pandas

df = pandas.DataFrame({'A': [1,2,3,4], 'B': [2,3,4,5]})

df = arbitrary_function_that_adds_columns(df)
print(df)

How do I apply the arbitrary_function_that_adds_columns function to a subset of the rows? I am trying this

import pandas

df = pandas.DataFrame({'A': [1,2,3,4], 'B': [2,3,4,5]})

rows = df['A'].isin({1,3})
df.loc[rows] = arbitrary_function_that_adds_columns(df.loc[rows])

print(df)

but I receive the original dataframe. The result I'm expecting to get is

   A  B  new column
0  1  2         NaN
1  2  3      10.375
2  3  4         NaN
3  4  5      68.625

CodePudding user response:

With the example you've given:

df['A B'] = df.loc[df['A'].isin({1,3})].sum(axis=1)

or

df['A B'] = np.nan
df.loc[df['A'].isin({1,3}),['A B']] = sum_AB(df)

More generally:

df.loc[ [row mask], [column mask] ] = [returned df of same shape]

#optionally, use fillna/bfill/ffill as appropriate

For more complicated stuff, take a look at DataFrame.transform and DataFrame.apply; combining those with df.loc and an appropriate boolean mask will accomplish what you need.

CodePudding user response:

Use pandas.combine_first

Note that, according to the expected output, you want rows=[1,3], not rows = df['A'].isin({1,3}). The latter selects all the rows whose 'A' value is either 1 or 3.

import pandas as pd 

def arbitrary_function_that_adds_columns(df):
    # make sure that the function doesn't mutate the original DataFrame
    # Otherwise, you will get a SettingWithCopyWarning 
    df = df.copy()

    df['new column'] = df['A']   df['B'] / 8   df['A']**3
    return df

df = pd.DataFrame({'A': [1,2,3,4], 'B': [2,3,4,5]})

rows = [1, 3]
# the function is applied to a copy of a DataFrame slice 
>>> sub_df = arbitrary_function_that_adds_columns(df.loc[rows])
>>> sub_df

   A  B  new column
1  2  3      10.375
3  4  5      68.625

# Add the new information to the original df 
>>> df = df.combine_first(sub_df)
>>> df

   A  B  new column
0  1  2         NaN
1  2  3      10.375
2  3  4         NaN
3  4  5      68.625
  • Related