I have a table (df1) with a list of values (neig_list, which is a python list) in each row
ID | neig_list |
---|---|
1 | a, b, d |
2 | b, e, f, g, h |
3 | b, a, j, k |
And a table (df2) with entries for those values
neig | samples | samples_indicator |
---|---|---|
'a' | 3 | 0.5 |
'a' | 5 | 0.1 |
'b' | 1 | 0.2 |
'c' | 15 | 0.5 |
'd' | 12 | 0.3 |
'a' | 2 | 1 |
'e' | 5 | 0.6 |
'f' | 6 | 0 |
'h' | 6 | 0.5 |
I need to add a column to df1 getting, for each row, the result for the sum of samples x samples_indicator for all neigs that are contained in the neig_list for that row.
For example, for the first row, we would have:
3*0.5 5*0.1 1*0.2 12*0.3 2*1 = 7.8
ID | neig_list | new_column |
---|---|---|
1 | a, b, d | 7.8 |
2 | b, e, f, g, h | value |
3 | b, a, j, k | value |
Actually, the function is more complicated than that (involves more columns), so ideally I'd like to have a separate function and then apply it to df1, based on df2.
CodePudding user response:
Calculate first your math in df2
:
map_ = df2.assign(neig = df2['neig'].str.strip("'"),
calculated = lambda df: df['samples'] * df['samples_indicator'])\
.groupby('neig')['calculated'].sum()
Then, explode
your first df, and map the values above for 'a'
, 'b'
etc with the calculated formula. Finally, groupby
and sum
:
df['new_column'] = df['neig_list'].str.split(', ').explode()\
.map(map_)\
.groupby(level=0)\
.sum()
ID neig_list new_column
0 1 a, b, d 7.8
1 2 b, e, f, g, h 6.2
2 3 b, a, j, k 4.2
CodePudding user response:
You can just define a function that performs the calculations for a given list of neig
s using df2
and then just apply it to neig_list
in df1
:
def result(row):
return sum([df2['samples'][item]*df2['samples_indicator'][item] for item in row])
df1['new_column'] = df1['neig_list'].apply(result)
Note that this requires neig
to be the index in df2. If it's not, you can do df2.set_index('neig', inplace=True)
or, if you don't want to modify d2:
def result(row):
return sum([df2.set_index('neig')['samples'][item]*df2.set_index('neig')['samples_indicator'][item] for item in row])
and apply it the same way as before.
CodePudding user response:
Here's a way to do what your question asks:
def foo(df1, df2):
return (df1
.join(df1.assign(neig=df1.neig_list).explode('neig')
.join(
df2.assign(new_column=df2.samples * df2.samples_indicator)[['neig','new_column']].groupby('neig').sum(),
on='neig')
.drop(columns=['neig','neig_list']).groupby('ID').sum(),
on='ID')
)
print(foo(df1, df2))
Output:
ID neig_list new_column
0 1 [a, b, d] 7.8
1 2 [b, e, f, g, h] 6.2
2 3 [b, a, j, k] 4.2
Explanation:
- use
assign()
to addnew_column
as a column todf2
which, usinggroupby()
andsum()
, gets populated with the dot-product ofsamples
andsamples_indicator
for the rows in eachneig
group - use
assign()
to clone theneig_list
column ofdf1
asneig
andexplode()
to expand each row to one row per item in theneig
column - use
join()
on the above two DataFrame objects to put sample results fromnew_column
into each row based on itsneig
value - use
join()
again with the above DataFrame object (after dropping theneig
andneig_list
columns) to add the desired column to the originaldf1
.