I want to get the list of values from col2
that belong to the same groupId
, given corresponding value in col1
. Col1
values can belong to multiple groups and in that case only top-most group should be considered (group 2 but not group 3 in my example). Col1
values are always identical within the same groupId
.
groupId | col1 | col2 |
---|---|---|
2 | a | 10 |
1 | b | 20 |
2 | a | 30 |
1 | b | 40 |
3 | a | 50 |
3 | a | 60 |
1 | b | 70 |
My current solution takes over 30s for a df
with 2000 rows and 32 values to search for in col1
('a' in this case):
group_id_groups = df.groupby('groupId')
for group_id, group in group_id_groups:
col2_values = list(group[group['col1'] == 'a']['col2'])
if col2_values:
print(col2_values)
break
result: [10, 30]
CodePudding user response:
The sort
parameter of groupby defaults to true, which means the first group will be the topmost by default. You can change the col_to_search
to b
and get the other answer.
import pandas as pd
df = pd.DataFrame({'groupId': [2, 1, 2, 1, 3, 3, 1],
'col1': ['a', 'b', 'a', 'b', 'a', 'a', 'b'],
'col2': [10, 20, 30, 40, 50, 60, 70]})
col_to_search = 'a'
(
df.loc[df['col1'].eq(col_to_search)]
.groupby('groupId')['col2']
.apply(list)
.iloc[0]
)
Output
[10, 30]
CodePudding user response:
I am still not sure what you want. Does this help you? I am sure that pandas.DataFrame.groupby()
is your friend here.
Full code
#!/usr/bin/env python3
import pandas as pd
# initial data
df = pd.DataFrame({
'groupId': [2, 1, 2, 1, 3, 3, 1],
'col1': list('ababaab'),
'col2': range(10, 80, 10)
})
print(df)
g = df.groupby(['groupId', 'col1']).agg(list)
print(g)
result = g.loc[(2, 'a')]
print(result)
Step by step
Your initial data in df
looks like this
groupId col1 col2
0 2 a 10
1 1 b 20
2 2 a 30
3 1 b 40
4 3 a 50
5 3 a 60
6 1 b 70
Then you simply group your data by your two "search columns". The result per group is stored as a list
.
g = df.groupby(['groupId', 'col1']).agg(list)
The result:
col2
groupId col1
1 b [20, 40, 70]
2 a [10, 30]
3 a [50, 60]
No you can do your search:
result = g.loc[(2, 'a')]
That gives you
col2 [10, 30]
Name: (2, a), dtype: object
CodePudding user response:
It seems to me that you mostly need to create a mask without using a groupby.
import pandas as pd
# data
data = {'groupId': {0: '2', 1: '1', 2: '2', 3: '1', 4: '3', 5: '3', 6: '1'},
'col1': {0: 'a', 1: 'b', 2: 'a', 3: 'b', 4: 'a', 5: 'a', 6: 'b'},
'col2': {0: 10, 1: 20, 2: 30, 3: 40, 4: 50, 5: 60, 6: 70}}
df = pd.DataFrame(data)
# First group where condition is satisfied
first_group = df[df["col1"].eq("a")].iloc[0]["groupId"]
# Output
df[df["col1"].eq("a") &
df["groupId"].eq(first_group)]["col2"].to_list()
And the output is [10, 30]
as expected.
CodePudding user response:
You can use pandas.groupby
with agg(list)
then search what you want with .loc
and return the first find.
>>> grp = df.groupby(['groupId', 'col1']).agg(list).reset_index()
>>> grp.loc[grp['col1'].eq('a'), 'col2'].to_list()[0]
[10, 30]
>>> grp.loc[grp['col1'].eq('a'), 'col2']
1 [10, 30]
2 [50, 60]
Name: col2, dtype: object