I have the following two data frames loaded in via pandas read_csv.
EDIT: I've added the DF constructor for easier use
import pandas as pd
import numpy as np
df1 = pd.DataFrame(columns=['var_id', 'functions', 'num_functions', 'influence'])
df2 = pd.DataFrame(columns=['var_id', 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24], index=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
influence_factor = 10
# df1 construct
df1['functions'] = [[1, 2, 5, 11, 12, 16, 17],[10, 11],[11, 19],[19],[11],[2],[11, 19],[19],[19],[11, 19],[11, 19]]
df1['var_id'] = ['AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC004','AA_ABC004','AA_ABC003']
df1['num_functions'] = df1.functions.map(len)
df1['influence'] = (influence_factor / df1['num_functions']).round(decimals=2)
# df2 construct
df2['var_id'] = ['AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC004','AA_ABC004','AA_ABC003']
df2 = df2.fillna(1.0)
# Split of function list to have easier handling of list in columns of lists
function_split = df1.functions.apply(pd.Series)
function_split = function_split.dropna(how='all')
df1 = df1.join(function_split)
df1_col = ['var_id', 'functions', 'num_functions', 'influence', 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
df1 = df1.reindex(columns=df1_col)
## Remove NaN
df1 = df1.fillna('')
df1
var_id functions num_functions influence 0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 AA_ABC006 [1, 2, 5, 11, 12, 16, 17] 7 1.429 1 2 5 7 9
1 AA_ABC006 [10, 11] 2 5.000 4 8
2 AA_ABC006 [11, 19] 2 5.000 1 2
3 AA_ABC006 [19] 1 10.00 9
4 AA_ABC005 [11] 1 10.00 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1964 XX_ABC004 [2, 11, 20] 3 3.333 2 11 20
1965 XX_ABC003 [19] 1 10.000 19
1966 XX_ABC003 [2, 11, 20] 3 3.333 2 11 20
1967 XX_ABC004 [2, 11, 20] 3 3.333 2 11 20
1968 XX_ABC003 [2, 11, 20] 3 3.333 2 11 20
df2 (where header numbers equal function numbers from df1)
0 1 2 3 4 5 6 7 8 9 ... 15 16 17 18 19 20 21 22 23 24
AA_ABC006 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
AA_ABC005 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
XX_ABC004 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
XX_ABC003 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
What I would like to achieve is to have a function that iterrates through both DFs and compares for functions matches. Thus for DF1 var_id AA_ABC006 functions 0, 1, 5 etc it should take 1.429 percent of 1.0 in DF2 columns 1, 2, 5 etc. and have this be applied over all rows/columns
desired result for this example would be
DF_result
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | … | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AA_ABC006 | 1,00 | 0,99 | 0,99 | 1,00 | 1,00 | 0,99 | 1,00 | 1,00 | 1,00 | 1,00 | … | 1,00 | 0,99 | 0,99 | 1,00 | 0,85 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 |
AA_ABC005 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | … | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
XX_ABC004 | 1,00 | 1,00 | 0,93 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | … | 1,00 | 1,00 | 1,00 | 1,00 | 0,95 | 0,93 | 1,00 | 1,00 | 1,00 | 1,00 |
XX_ABC003 | 1,00 | 1,00 | 0,93 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | … | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 0,93 | 1,00 | 1,00 | 1,00 | 1,00 |
CodePudding user response:
df2
values for var_id
should be unique iiuc. I changed the constructor for this df. (var_id
ends up being the index)
You can build a dataframe for the values to be subtracted out of df1
rows using concat
:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(columns=['var_id', 'functions', 'num_functions', 'influence'])
influence_factor = 10
# df1 construct
df1['functions'] = [[1, 2, 5, 11, 12, 16, 17],[10, 11],[11, 19],[19],[11],[2],[11, 19],[19],[19],[11, 19],[11, 19]]
df1['var_id'] = ['AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC004','AA_ABC004','AA_ABC003']
df1['num_functions'] = df1.functions.map(len)
df1['influence'] = (influence_factor / df1['num_functions']).round(decimals=2)
# df2 construct
ids = df1['var_id'].unique()
df2 = pd.DataFrame(np.ones((len(ids), 25)), index=ids)
df_tmp = pd.concat([
pd.DataFrame([row['influence']*np.ones(row['num_functions'])],
columns=row['functions'], index=[row['var_id']])
for _, row in df1.iterrows()
]
)
print((df2 - df_tmp.groupby(level=0).sum()/100).fillna(1.0).round(decimals=2))
Output:
0 1 2 3 4 5 6 7 8 9 ... 15 16 17 18 19 20 21 22 23 24
AA_ABC003 1.0 1.00 1.00 1.0 1.0 1.00 1.0 1.0 1.0 1.0 ... 1.0 1.00 1.00 1.0 0.95 1.0 1.0 1.0 1.0 1.0
AA_ABC004 1.0 1.00 1.00 1.0 1.0 1.00 1.0 1.0 1.0 1.0 ... 1.0 1.00 1.00 1.0 0.85 1.0 1.0 1.0 1.0 1.0
AA_ABC005 1.0 1.00 0.90 1.0 1.0 1.00 1.0 1.0 1.0 1.0 ... 1.0 1.00 1.00 1.0 0.85 1.0 1.0 1.0 1.0 1.0
AA_ABC006 1.0 0.99 0.99 1.0 1.0 0.99 1.0 1.0 1.0 1.0 ... 1.0 0.99 0.99 1.0 0.85 1.0 1.0 1.0 1.0 1.0