Home > Back-end >  Fastest way to get all first-matched rows given a sequence of column values in Pandas
Fastest way to get all first-matched rows given a sequence of column values in Pandas

Time:10-03

Say I have a Pandas dataframe with 10 rows and 2 columns.

import pandas as pd
df = pd.DataFrame({'col1': [1,2,4,3,1,3,1,5,1,4],
                   'col2': [.9,.7,.1,.3,.2,.4,.8,.2,.3,.5]})
df

   col1  col2
0     1   0.9
1     2   0.7
2     4   0.1
3     3   0.3
4     1   0.2
5     3   0.4
6     1   0.8
7     5   0.2
8     1   0.3
9     4   0.5

Now that I am given a sequence of 'col1' values in a numpy array:

import numpy as np
nums = np.array([3,1,2])

I want to find the rows that have the first occurence of 3, 1 and 2 in 'col1', and then get the corresponding 'col2' values in order. Right now I am using a list comprehension:

res = np.array([df[df['col1']==n].reset_index(drop=True).at[0,'col2'] for n in nums])
res

[0.3 0.9 0.7]

This works for small dataframes, but becomes the bottleneck of my code as I have a huge dataframe of more than 30,000 rows and is often given long sequences of column values (> 3,000). I would like to know if there is a more efficient way to do this?

CodePudding user response:

Option 1

Perhaps faster than what I suggested earlier (below: option 2):

df.groupby('col1').first().reindex(nums)

      col2
col1      
3      0.3
1      0.9
2      0.7

Option 2

  • First get the matches for col1 by using Series.isin and select from the df based on the mask.
  • Now, apply df.groupby and get the first non-null entry for each group.
  • Finally, apply df.reindex to sort the values.
df[df['col1'].isin(nums)].groupby('col1').first().reindex(nums)

      col2
col1      
3      0.3
1      0.9
2      0.7

If a value cannot be found, you'll end up with a NaN. E.g.

df.iloc[1,0] = 6 # there's no '2' in `col1` anymore
df[df['col1'].isin(nums)].groupby('col1').first().reindex(nums)

      col2
col1      
3      0.3
1      0.9
2      NaN

CodePudding user response:

Perhaps, alternate solution (might not be optimal, but) can be think as follows:

df.drop_duplicates(subset='col1', keep='first')
  .merge(df1, how='inner', on='col1')
  .set_index('col1')
  .reindex(nums)

    col2
col1    
3   0.3
1   0.9
2   0.7
  • Related