Home > Net >  How to extract group list element using pandas criteria
How to extract group list element using pandas criteria

Time:03-20

I have a pandas dataframe like as shown below

ID,color
1, Yellow
1, Red
1, Green
2, Red
2, np.nan
3, Green
3, Red
3, Green
4, Yellow
4, Red
5, Green
5, np.nan
6, Red
7, Red

fd = pd.read_clipboard(sep=',')

As you can see in the input dataframe, some ID's have multiple colors associated to them.

So, whenever there is multiple color associated to them, I would like to select only one color based on the below criteria

['Green','Red','Yellow'] = Choose 'Green'
['Red', 'Yellow'] = Choose 'Yellow'
['Green', 'Yellow'] = Choose 'Green'

Basically, Green is given 1st preference. 2nd preference is for Yellow and last preference is for Red.

So, if an ID whenever has Green, choose Green (don't care about other colors).

If an ID whenever has Yellow and Red, choose Yellow

If an ID for its all rows has only NA, leave it as NA

I tried the below but this only gets me the list of color

fd.groupby('ID',as_index=False)['color'].aggregate(lambda x: list(x))
fd[final_color] = [if i[0] =='Green' for i in fd[col]]

I expect my output to be like as shown below

enter image description here

CodePudding user response:

Sort the values of dataframe on color with the help of the preference dictionary, then drop the duplicates on ID

d = {'Green': 1, 'Yellow': 2, 'Red': 3}
df.sort_values('color', key=lambda c: c.map(d)).drop_duplicates('ID')

Alternative approach by first converting the color column to ordered categorical type, then groupby and aggregate to select the min value

df['color'] = pd.Categorical(df['color'], ['Green', 'Yellow', 'Red'], True)
df.groupby('ID', as_index=False)['color'].agg('min')

   ID   color
0   1   Green
1   2     Red
2   3   Green
3   4  Yellow
4   5   Green
5   6     Red
6   7     Red

CodePudding user response:

One way to solve this is to implement a custom sorting:

sort_preference = {
    'Green': 0,
    'Yellow': 1,
}

(
    fd
    .sort_values(by=['color'], key=lambda x: x.map(sort_preference))
    .groupby('ID')
    .head(1)
)

CodePudding user response:

Without sorting, you can use idxmin if you map your color with numeric values:

d = {'Green': 1, 'Yellow': 2, 'Red': 3}
out = df.loc[df.assign(num=df['color'].map(d)).groupby('ID')['num'].idxmin()]
print(out)

# Output
    ID   color
2    1   Green
3    2     Red
5    3   Green
8    4  Yellow
10   5   Green
12   6     Red
13   7     Red
  • Related