So I have a dataframe (D1) that looks like this:
nID | name |
---|---|
n1 | Sarah |
n2 | John |
and I have another dataframe (D2) which looks like
tID | writers | directors |
---|---|---|
t1 | n1,n4 | n2,n3 |
t2 | n4 | n3 |
t3 | n1 | n2. |
The nID in D1 can be in either writers or directors column in D2. Now for each nID, I want to get all the tID where the nID is in either of the columns and store the list of tID in a new column in D1.
So the final D1 would looks like this:
nID | name. | tIdS. |
---|---|---|
n1 | Sara | t1,t3 |
n2 | John | t1,t3 |
What's the most efficient way to do this using Pandas, as I have about 30k rows in D1 and 800k rows in D2? Is it better to split the column into multiple columns and do a merge?
CodePudding user response:
we can do explode
then melt
groupby
with the nId in df2
get the tID
df2['writers'] = df2['writers'].str.split(',')
df2['directors'] = df2['directors'].str.split(',')
s = df2.melt('tID').explode('value').groupby('value')['tID'].agg(','.join)
df1['new'] = df1['nID'].map(s)
df1
Out[221]:
nID name new
0 n1 Sarah t1,t3
1 n2 John t1,t3
CodePudding user response:
Well, here is a way to do this:
d1.merge(
d2.assign(nID=(d2['writers'] d2['directors']).apply(set))
.explode('nID').groupby('nID')['tID'].agg(set),
on='nID', how='left')
This is assuming the columns 'writers' and 'directors' are already type list
.
Speed:
np.random.seed(0)
n1 = 3_000
n2 = 80_000
d1 = pd.DataFrame({
'nID': np.arange(n1),
'name': [f'name_{i:06d}' for i in range(n1)],
})
d2 = pd.DataFrame({
'tID': np.arange(n2),
'writers': [
list(np.random.choice(d1.nID, k, replace=False))
for k in np.random.randint(1, min(4, n1) 1, n2)
],
'directors': [
list(np.random.choice(d1.nID, k, replace=False))
for k in np.random.randint(1, min(4, n1) 1, n2)
]
})
%timeit d1.merge(d2.assign(nID=(d2['writers'] d2['directors']).apply(set)).explode('nID').groupby('nID')['tID'].agg(set), on='nID', how='left')
384 ms ± 697 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)