pandas iterate through two dataframes based on multiple conditions-CodePudding

I have the following two data frames loaded in via pandas read_csv.

EDIT: I've added the DF constructor for easier use

import pandas as pd
import numpy as np

df1 = pd.DataFrame(columns=['var_id', 'functions', 'num_functions', 'influence'])
df2 = pd.DataFrame(columns=['var_id', 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24], index=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

influence_factor = 10

# df1 construct
df1['functions'] = [[1, 2, 5, 11, 12, 16, 17],[10, 11],[11, 19],[19],[11],[2],[11, 19],[19],[19],[11, 19],[11, 19]]
df1['var_id'] = ['AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC004','AA_ABC004','AA_ABC003']
df1['num_functions'] = df1.functions.map(len)
df1['influence'] = (influence_factor / df1['num_functions']).round(decimals=2)

# df2 construct
df2['var_id'] = ['AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC004','AA_ABC004','AA_ABC003']
df2 = df2.fillna(1.0)

# Split of function list to have easier handling of list in columns of lists
function_split = df1.functions.apply(pd.Series)
function_split = function_split.dropna(how='all')
df1 = df1.join(function_split)
df1_col = ['var_id', 'functions', 'num_functions', 'influence',  0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
df1 = df1.reindex(columns=df1_col)

## Remove NaN
df1 = df1.fillna('')

df1

        var_id      functions                   num_functions   influence   0   1   2   3   4   5   6   7   8   9   10  11  12  13
0       AA_ABC006   [1, 2, 5, 11, 12, 16, 17]   7               1.429       1   2   5   7   9                               
1       AA_ABC006   [10, 11]                    2               5.000       4   8                                               
2       AA_ABC006   [11, 19]                    2               5.000       1   2                                               
3       AA_ABC006   [19]                        1               10.00       9                                                   
4       AA_ABC005   [11]                        1               10.00       0                                                   
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1964    XX_ABC004   [2, 11, 20]                 3               3.333       2   11  20                                          
1965    XX_ABC003   [19]                        1               10.000      19                                                  
1966    XX_ABC003   [2, 11, 20]                 3               3.333       2   11  20                                          
1967    XX_ABC004   [2, 11, 20]                 3               3.333       2   11  20                                          
1968    XX_ABC003   [2, 11, 20]                 3               3.333       2   11  20

df2 (where header numbers equal function numbers from df1)

            0   1   2   3   4   5   6   7   8   9   ... 15  16  17  18  19  20  21  22  23  24
AA_ABC006   1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
AA_ABC005   1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
XX_ABC004   1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
XX_ABC003   1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

What I would like to achieve is to have a function that iterrates through both DFs and compares for functions matches. Thus for DF1 var_id AA_ABC006 functions 0, 1, 5 etc it should take 1.429 percent of 1.0 in DF2 columns 1, 2, 5 etc. and have this be applied over all rows/columns

desired result for this example would be

DF_result

	0	1	2	3	4	5	6	7	8	9	…	15	16	17	18	19	20	21	22	23	24
AA_ABC006	1,00	0,99	0,99	1,00	1,00	0,99	1,00	1,00	1,00	1,00	…	1,00	0,99	0,99	1,00	0,85	1,00	1,00	1,00	1,00	1,00
AA_ABC005	1,00	1,00	1,00	1,00	1,00	1,00	1,00	1,00	1,00	1,00	…	1,00	1,00	1,00	1,00	1,00	1,00	1,00	1,00	1,00	1,00
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
XX_ABC004	1,00	1,00	0,93	1,00	1,00	1,00	1,00	1,00	1,00	1,00	…	1,00	1,00	1,00	1,00	0,95	0,93	1,00	1,00	1,00	1,00
XX_ABC003	1,00	1,00	0,93	1,00	1,00	1,00	1,00	1,00	1,00	1,00	…	1,00	1,00	1,00	1,00	1,00	0,93	1,00	1,00	1,00	1,00

CodePudding user response：

df2 values for var_id should be unique iiuc. I changed the constructor for this df. (var_id ends up being the index)

You can build a dataframe for the values to be subtracted out of df1 rows using concat:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(columns=['var_id', 'functions', 'num_functions', 'influence'])

influence_factor = 10

# df1 construct
df1['functions'] = [[1, 2, 5, 11, 12, 16, 17],[10, 11],[11, 19],[19],[11],[2],[11, 19],[19],[19],[11, 19],[11, 19]]
df1['var_id'] = ['AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC004','AA_ABC004','AA_ABC003']
df1['num_functions'] = df1.functions.map(len)
df1['influence'] = (influence_factor / df1['num_functions']).round(decimals=2)

# df2 construct
ids = df1['var_id'].unique()
df2 = pd.DataFrame(np.ones((len(ids), 25)), index=ids)

df_tmp = pd.concat([
            pd.DataFrame([row['influence']*np.ones(row['num_functions'])],
                        columns=row['functions'], index=[row['var_id']])
            for _, row in df1.iterrows()
        ]
)
print((df2 - df_tmp.groupby(level=0).sum()/100).fillna(1.0).round(decimals=2))

Output:

            0     1     2    3    4     5    6    7    8    9   ...   15    16    17   18    19   20   21   22   23   24
AA_ABC003  1.0  1.00  1.00  1.0  1.0  1.00  1.0  1.0  1.0  1.0  ...  1.0  1.00  1.00  1.0  0.95  1.0  1.0  1.0  1.0  1.0
AA_ABC004  1.0  1.00  1.00  1.0  1.0  1.00  1.0  1.0  1.0  1.0  ...  1.0  1.00  1.00  1.0  0.85  1.0  1.0  1.0  1.0  1.0
AA_ABC005  1.0  1.00  0.90  1.0  1.0  1.00  1.0  1.0  1.0  1.0  ...  1.0  1.00  1.00  1.0  0.85  1.0  1.0  1.0  1.0  1.0
AA_ABC006  1.0  0.99  0.99  1.0  1.0  0.99  1.0  1.0  1.0  1.0  ...  1.0  0.99  0.99  1.0  0.85  1.0  1.0  1.0  1.0  1.0