Home > Software design >  Ordering Pairs of Data by date - Pandas
Ordering Pairs of Data by date - Pandas

Time:07-12

I am somewhat new to coding in Pandas and I have what I think to be a simple problem that I can't find an answer to. I have a list of students, the college they went to and what year they entered college.

Name College Year
Mary Princeton 2017
Joe Harvard 2018
Bill Princeton 2016
Louise Princeton 2020
Michael Harvard 2019
Penny Yale 2018
Harry Yale 2015

I need the data to be ordered by year but grouped by college. However, if I order by year then I get the years in order but the colleges not together and if I order by college then I get the colleges together in alphabetical order but not with the years in order. Similarly if I order by year then college I won't get the colleges together and if I order by college then year I can't guarantee that the most recent year is first. What I want the table to look like is:

Name College Year
Louise Princeton 2020
Mary Princeton 2017
Bill Princeton 2016
Michael Harvard 2019
Joe Harvard 2018
Penny Yale 2018
Harry Yale 2015

So we see Princeton is first because it has the most recent year, but all the Princeton colleges are all together. Than Harvard is next because 2019>2018 which is the most recent year for Yale so it has the two Harvard schools. Followed by Yale since 2020>2019>2018. I appreciate all your ideas and help! Thank you!

CodePudding user response:

Add a temporary extra column with the max year per group and sort on multiple columns:

out = (df
 .assign(max_year=df.groupby('College')['Year'].transform('max'))
 .sort_values(by=['max_year', 'College', 'Year'], ascending=[False, True, False])
 .drop(columns='max_year')
)

output:

      Name    College  Year
3   Louise  Princeton  2020
0     Mary  Princeton  2017
2     Bill  Princeton  2016
4  Michael    Harvard  2019
1      Joe    Harvard  2018
5    Penny       Yale  2018
6    Harry       Yale  2015

with temporary column:

      Name    College  Year  max_year
3   Louise  Princeton  2020      2020
0     Mary  Princeton  2017      2020
2     Bill  Princeton  2016      2020
4  Michael    Harvard  2019      2019
1      Joe    Harvard  2018      2019
5    Penny       Yale  2018      2018
6    Harry       Yale  2015      2018

CodePudding user response:

You first want to sort by "College" then "Year", then keep "College" values together by using .groupby

import pandas as pd

data = [
    ["Mary", "Princeton", 2017],
    ["Joe", "Harvard", 2018],
    ["Bill", "Princeton", 2016],
    ["Louise", "Princeton", 2020],
    ["Michael", "Harvard", 2019],
    ["Penny", "Yale", 2018],
    ["Harry", "Yale", 2015],
]
df = pd.DataFrame(data, columns=["Name", "College", "Year"])
df.sort_values(["College", "Year"], ascending=False).groupby("College").head()

You'd get this output:

   Name    College  Year
  Penny       Yale  2018
  Harry       Yale  2015
 Louise  Princeton  2020
   Mary  Princeton  2017
   Bill  Princeton  2016
Michael    Harvard  2019
    Joe    Harvard  2018

CodePudding user response:

You will have to first find the maximum among each group and set that as a column. You can then sort by values based on max and year.

df=pd.read_table('./table.txt')
df["max"]=df.groupby("College")["Year"].transform("max")
df.sort_values(by=["max","Year"],ascending=False).drop(columns="max").reset_index(drop=True)

Output:

Out[60]:
      Name    College  Year
0   Louise  Princeton  2020
1     Mary  Princeton  2017
2     Bill  Princeton  2016
3  Michael    Harvard  2019
4      Joe    Harvard  2018
5    Penny       Yale  2018
6    Harry       Yale  2015

  • Related