Here is the cars data frame:
df = pd.DataFrame({ 'car' : ['brand1_2020', 'brand2', 'brand3', 'brand1_2018', 'brand4'],
'score' : [5.2, 4.9, 5.0, 5.1, 4.1]})
I have to select top 3 cars with unique brands (for example, there are two columns for brand1: brand1_2020 and brand1_2018.. I need to select only one from brand1 with the highest score).
cars_filtered = df.iloc[np.argsort(np.abs(df['score']))[-3:]].car.tolist()
This produces: ['brand3', 'brand1_2018', 'brand1_2020'] but we need to get ['brand2', 'brand3','brand1_2020'] .
How do we do that? :) Thanks a lot!
CodePudding user response:
I would consider splitting the strings in the 'cars'
column and creating a new column called brand:
df['brand'] = [x[0] for x in df['car'].str.split("_")]
This will allow you to treat 'brand1_2021'
and 'brand1_2018'
as the same 'brand1'
category.
>>> df
car score brand
0 brand1_2020 5.2 brand1
1 brand2 4.9 brand2
2 brand3 5.0 brand3
3 brand1_2018 5.1 brand1
4 brand4 4.1 brand4
Then you can sort the DataFrame by score in order from largest to smallest, drop any repeated entries under the 'brand'
column, and select the top three values in the original 'car'
column:
df.sort_values(by="score", ascending=False).drop_duplicates(subset="brand").car[0:3].to_list()
Output:
['brand1_2020', 'brand3', 'brand2']
CodePudding user response:
import pandas as pd
df = pd.DataFrame({
'car': ['brand1_2020', 'brand2', 'brand3', 'brand1_2018', 'brand4'],
'score': [5.2, 4.9, 5.0, 5.1, 4.1]
})
score_list = df.score.to_list()
cars_list = []
for score in score_list[0:3]:
car_string = ''.join(df[df['score'] == score]['car'].to_list())
cars_list.append(car_string)
print(cars_list)
CodePudding user response:
Group by on the prefixes (brand1
, ...), get the booleans, and finally keep the top three:
bools = df.car.str.split('_').str[0]
bools = df.groupby(bools).score.transform('max')
df.loc[df.score == bools].nlargest(3, 'score').car.tolist()
['brand1_2020', 'brand3', 'brand2']