More efficient way to search through Pandas groups-CodePudding

I want to get the list of values from col2 that belong to the same groupId, given corresponding value in col1. Col1 values can belong to multiple groups and in that case only top-most group should be considered (group 2 but not group 3 in my example). Col1 values are always identical within the same groupId.

groupId	col1	col2
2	a	10
1	b	20
2	a	30
1	b	40
3	a	50
3	a	60
1	b	70

My current solution takes over 30s for a df with 2000 rows and 32 values to search for in col1 ('a' in this case):

group_id_groups = df.groupby('groupId') 

for group_id, group in group_id_groups:
    col2_values = list(group[group['col1'] == 'a']['col2'])
    if col2_values:
        print(col2_values)
        break


result: [10, 30]

CodePudding user response：

The sort parameter of groupby defaults to true, which means the first group will be the topmost by default. You can change the col_to_search to b and get the other answer.

import pandas as pd
df = pd.DataFrame({'groupId': [2, 1, 2, 1, 3, 3, 1],
 'col1': ['a', 'b', 'a', 'b', 'a', 'a', 'b'],
 'col2': [10, 20, 30, 40, 50, 60, 70]})

col_to_search = 'a'
(
    df.loc[df['col1'].eq(col_to_search)]
    .groupby('groupId')['col2']
    .apply(list)
    .iloc[0]
)

Output

[10, 30]

CodePudding user response：

I am still not sure what you want. Does this help you? I am sure that pandas.DataFrame.groupby() is your friend here.

Full code

#!/usr/bin/env python3
import pandas as pd

# initial data
df = pd.DataFrame({
    'groupId': [2, 1, 2, 1, 3, 3, 1],
    'col1': list('ababaab'),
    'col2': range(10, 80, 10)
})
print(df)

g = df.groupby(['groupId', 'col1']).agg(list)
print(g)

result = g.loc[(2, 'a')]
print(result)

Step by step

Your initial data in df looks like this

    groupId col1  col2
 0        2    a    10
 1        1    b    20
 2        2    a    30
 3        1    b    40
 4        3    a    50
 5        3    a    60
 6        1    b    70

Then you simply group your data by your two "search columns". The result per group is stored as a list.

g = df.groupby(['groupId', 'col1']).agg(list)

The result:

                      col2
groupId col1              
1       b     [20, 40, 70]
2       a         [10, 30]
3       a         [50, 60]

No you can do your search:

result = g.loc[(2, 'a')]

That gives you

col2    [10, 30]
Name: (2, a), dtype: object

CodePudding user response：

It seems to me that you mostly need to create a mask without using a groupby.

import pandas as pd

# data
data = {'groupId': {0: '2', 1: '1', 2: '2', 3: '1', 4: '3', 5: '3', 6: '1'},
 'col1': {0: 'a', 1: 'b', 2: 'a', 3: 'b', 4: 'a', 5: 'a', 6: 'b'},
 'col2': {0: 10, 1: 20, 2: 30, 3: 40, 4: 50, 5: 60, 6: 70}}

df = pd.DataFrame(data)

# First group where condition is satisfied
first_group = df[df["col1"].eq("a")].iloc[0]["groupId"]

# Output
df[df["col1"].eq("a") &
  df["groupId"].eq(first_group)]["col2"].to_list()

And the output is [10, 30] as expected.

CodePudding user response：

You can use pandas.groupby with agg(list) then search what you want with .loc and return the first find.

>>> grp = df.groupby(['groupId', 'col1']).agg(list).reset_index()

>>> grp.loc[grp['col1'].eq('a'), 'col2'].to_list()[0]
[10, 30]

>>> grp.loc[grp['col1'].eq('a'), 'col2']
1    [10, 30]
2    [50, 60]
Name: col2, dtype: object