Home > Software engineering >  Comparing Data in Pandas
Comparing Data in Pandas

Time:03-21

I am just trying to get some data and re-arrange it. Here is my dataset showing foods and the scores they received in different years. What I want to do is find the foods which had the lowest and highest scores on average and track their scores across the years.

enter image description here

Here is my code so far:

#import
menu = pd.read_csv("Food.csv")

#Get average rank
avg_ranking = menu.groupby(['Food']).Score.mean().round()

#Convert to DataFrame
data2 = pd.DataFrame(avg_ranking.reset_index(name = "Group"))

#Display the highest and lowest ranking 
filter_food = data2[['Food', 'Group']][(data2['Group'] == avg_ranking.max()) | (data2['Group'] == avg_ranking.min())]

#This is where I have been playing around with getting the individual min and max food names:
filter_food_max = data2[['Food']][(data2['Group'] == data2.Group.max())]
Max = filter_food_max.Food
filter_food_min = data2[['Food']][(data2['Group'] == data2.Group.min())]
Min = filter_food_min.Food

The next part is where I am a little stuck (please note - I do want to practice using the group-by, so it'll be super awesome if that is kept somewhere): I'd need to display the max and min foods from the original dataset that would show all the columns - Food, year, Score. This is what I have tried, but it doesn't work:

menu[menu.Food == Max & menu.Food == Min]

Basically I want it to display something like the below in a dataframe, so I can plot some graphs (i.e. I want to then make a line plot which would display the years on the x-axis, scores on the y-axis and plot the lowest scoring food and the top scoring food:

enter image description here

If you guys know any other ways of doing this (using the group-by function), please let me know!

Any help would be appreciated

CodePudding user response:

You can select first and last rows per year by Series.duplicated with invert mask and chain by | for bitwise OR, filter in boolean indexing:

df1 = df[~df['year'].duplicated() | ~df['year'].duplicated(keep='last')]

Solution with groupby:

df1 = df.groupby('year').agg(['first','last']).stack(1).droplevel(1).reset_index()

If need minimal and maximal per years:

df = df.sort_values(['year','food'])
df2 = df[~df['year'].duplicated() | ~df['year'].duplicated(keep='last')]

Solution with groupby:

df2 = df.loc[df.groupby('year')['Score'].agg(['idxmax','idxmin']).stack()]              
  • Related