Apply a function on two pandas tables-CodePudding

I have the following two tables:

>>> df1 = pd.DataFrame(data={'1': ['john', '10', 'john'],
...                         '2': ['mike', '30', 'ana'],
...                         '3': ['ana', '20', 'mike'],
...                         '4': ['eve', 'eve', 'eve'],
...                         '5': ['10', np.NaN, '10'],
...                         '6': [np.NaN, np.NaN, '20']},
...                   index=pd.Series(['ind1', 'ind2', 'ind3'], name='index'))
>>> df1
        1     2     3    4    5    6
index
ind1   john  mike   ana  eve   10  NaN
ind2     10    30    20  eve  NaN  NaN
ind3   john   ana  mike  eve   10   20


df2 = pd.DataFrame(data={'first_n': [4, 4, 3]},
                   index=pd.Series(['ind1', 'ind2', 'ind3'], name='index'))
>>> df2
    first_n
index
ind1         4
ind2         4
ind3         3

I also have the following function that reverses a list and gets the first n non-NA elements:

def get_rev_first_n(row, top_n):
    rev_row = [x for x in row[::-1] if x == x]
    return rev_row[:top_n]

>>> get_rev_first_n(['john', 'mike', 'ana', 'eve', '10', np.NaN], 4)
['10', 'eve', 'ana', 'mike']

How would I apply this function to the two tables so that it takes in both df1 and df2 and outputs either a list or columns?

CodePudding user response：

You can try apply with lambda on each row of the data frame, I just concatenate the two df's using concat and applied your method to each row of the resulted dataframe.

Full Code:

import pandas as pd
import numpy as np

def get_rev_first_n(row, top_n):
    rev_row = [x for x in row[::-1] if x == x]
    return rev_row[1:top_n]

df1 = pd.DataFrame(data={'1': ['john', '10', 'john'],
                         '2': ['mike', '30', 'ana'],
                         '3': ['ana', '20', 'mike'],
                         '4': ['eve', 'eve', 'eve'],
                         '5': ['10', np.NaN, '10'],
                         '6': [np.NaN, np.NaN, '20']},
                   index=pd.Series(['ind1', 'ind2', 'ind3'], name='index'))


df2 = pd.DataFrame(data={'first_n': [4, 4, 3]},
                   index=pd.Series(['ind1', 'ind2', 'ind3'], name='index'))
df3 = pd.concat([df1, df2.reindex(df1.index)], axis=1)
df = df3.apply(lambda row : get_rev_first_n(row, row['first_n']), axis = 1)
print(df)

Output:

index
ind1    [10, eve, ana]
ind2     [eve, 20, 30]
ind3          [20, 10]
dtype: object

CodePudding user response：

df=pd.concat([df1,df2],axis=1)
df.apply(get_rev_first_n,args=[4])  #send args as top_in

axis=0 is run along rows means runs on each column which is the default you don't have to specify it

args=[4] will be passed to second argument of get_rev_first_n