Most efficient way to iterate rows across 2 dataframes checking for a condition-CodePudding

So I have a dataframe (D1) that looks like this:

nID	name
n1	Sarah
n2	John

and I have another dataframe (D2) which looks like

tID	writers	directors
t1	n1,n4	n2,n3
t2	n4	n3
t3	n1	n2.

The nID in D1 can be in either writers or directors column in D2. Now for each nID, I want to get all the tID where the nID is in either of the columns and store the list of tID in a new column in D1.

So the final D1 would looks like this:

nID	name.	tIdS.
n1	Sara	t1,t3
n2	John	t1,t3

What's the most efficient way to do this using Pandas, as I have about 30k rows in D1 and 800k rows in D2? Is it better to split the column into multiple columns and do a merge?

CodePudding user response：

we can do explode then melt groupby with the nId in df2 get the tID

df2['writers'] = df2['writers'].str.split(',')
df2['directors'] = df2['directors'].str.split(',')
s = df2.melt('tID').explode('value').groupby('value')['tID'].agg(','.join)
df1['new'] = df1['nID'].map(s)
df1
Out[221]: 
  nID   name    new
0  n1  Sarah  t1,t3
1  n2   John  t1,t3

CodePudding user response：

Well, here is a way to do this:

d1.merge(
    d2.assign(nID=(d2['writers']   d2['directors']).apply(set))
    .explode('nID').groupby('nID')['tID'].agg(set),
    on='nID', how='left')

This is assuming the columns 'writers' and 'directors' are already type list.

Speed:

np.random.seed(0)

n1 = 3_000
n2 = 80_000
d1 = pd.DataFrame({
    'nID': np.arange(n1),
    'name': [f'name_{i:06d}' for i in range(n1)],
})
d2 = pd.DataFrame({
    'tID': np.arange(n2),
    'writers': [
        list(np.random.choice(d1.nID, k, replace=False))
        for k in np.random.randint(1, min(4, n1) 1, n2)
    ],
    'directors': [
        list(np.random.choice(d1.nID, k, replace=False))
        for k in np.random.randint(1, min(4, n1) 1, n2)
    ]
})

%timeit d1.merge(d2.assign(nID=(d2['writers']   d2['directors']).apply(set)).explode('nID').groupby('nID')['tID'].agg(set), on='nID', how='left')
384 ms ± 697 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)