Home > OS >  Most efficient way to iterate rows across 2 dataframes checking for a condition
Most efficient way to iterate rows across 2 dataframes checking for a condition

Time:04-06

So I have a dataframe (D1) that looks like this:

nID name
n1 Sarah
n2 John

and I have another dataframe (D2) which looks like

tID writers directors
t1 n1,n4 n2,n3
t2 n4 n3
t3 n1 n2.

The nID in D1 can be in either writers or directors column in D2. Now for each nID, I want to get all the tID where the nID is in either of the columns and store the list of tID in a new column in D1.

So the final D1 would looks like this:

nID name. tIdS.
n1 Sara t1,t3
n2 John t1,t3

What's the most efficient way to do this using Pandas, as I have about 30k rows in D1 and 800k rows in D2? Is it better to split the column into multiple columns and do a merge?

CodePudding user response:

we can do explode then melt groupby with the nId in df2 get the tID

df2['writers'] = df2['writers'].str.split(',')
df2['directors'] = df2['directors'].str.split(',')
s = df2.melt('tID').explode('value').groupby('value')['tID'].agg(','.join)
df1['new'] = df1['nID'].map(s)
df1
Out[221]: 
  nID   name    new
0  n1  Sarah  t1,t3
1  n2   John  t1,t3

CodePudding user response:

Well, here is a way to do this:

d1.merge(
    d2.assign(nID=(d2['writers']   d2['directors']).apply(set))
    .explode('nID').groupby('nID')['tID'].agg(set),
    on='nID', how='left')

This is assuming the columns 'writers' and 'directors' are already type list.

Speed:

np.random.seed(0)

n1 = 3_000
n2 = 80_000
d1 = pd.DataFrame({
    'nID': np.arange(n1),
    'name': [f'name_{i:06d}' for i in range(n1)],
})
d2 = pd.DataFrame({
    'tID': np.arange(n2),
    'writers': [
        list(np.random.choice(d1.nID, k, replace=False))
        for k in np.random.randint(1, min(4, n1) 1, n2)
    ],
    'directors': [
        list(np.random.choice(d1.nID, k, replace=False))
        for k in np.random.randint(1, min(4, n1) 1, n2)
    ]
})

%timeit d1.merge(d2.assign(nID=(d2['writers']   d2['directors']).apply(set)).explode('nID').groupby('nID')['tID'].agg(set), on='nID', how='left')
384 ms ± 697 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • Related