sorry if this sounds like a silly question, I'm very new to python and pandas and I've been having trouble with a pivot table. Any help would be appreciated.
I have a file which looks like this:
Idx State Sex Year Name Count
0 GA F 1910 Mary 841
1 GA F 1910 Annie 553
2 GA F 1910 Mattie 320
3 GA F 1910 Ruby 279
4 GA F 1910 Willie 275
5 GA F 1910 Louise 231
I wanted to create a pivot table, which returns a column of sequential years [one row for each], along with a column for the Name which has the highest count, and a Count column showing that maximum eg
Year Name Count
1910 Mary 1841
1911 Paul 553
1912 June 9620
I've tried:
import pandas as pd
bn = pd.read_csv("baby_names.csv")
bn['Count'].astype(int)
by_yr = pd.pivot_table(data=bn, values="Count", index=['Year','Name'], aggfunc='max' )
print(by_yr.head(10))
as well as:
import pandas as pd
bn = pd.read_csv("baby_names.csv")
bn['Count'].astype(int)
by_yr = pd.pivot_table(data=bn, values=['Name',"Count"], index='Year', aggfunc='max' )
print(by_yr.head(10))
but neither of them work. I'm sure I'm missing something stupid but any help is appreciated. Thanks
CodePudding user response:
You can use groupby('Year')
and nlargest
to find, for each group/year, the row of the maximum Count
.
step1 = df.groupby('Year').apply(lambda df: df.nlargest(1, 'Count'))
Then takes care of the formatting with
step1.droplevel(1, axis=0)[['Name', 'Count']].reset_index()
CodePudding user response:
try this:
I expanded your dataframe sample so we could be able to better verify the results:
df = pd.DataFrame({0:[0,'GA','F','1910','Mary',841],
1:[1,'GA','F','1910','Annie',553],
2:[2,'GA','F','1910','Mattie',320],
3:[3,'GA','F','1910','Ruby',279],
4:[4,'GA','F','1910','Willie',275],
5:[5,'GA','F','1910','Louise',231],
6:[6,'GA','M','1911','Paul',234],
7:[7,'GA','F','1911','Mattie',745],
8:[8,'GA','M','1911','Paul',654],
9:[9,'GA','M','1911','Paul',105],
10:[10,'GA','F','1911','Louise',76],
11:[11,'GA','M','1911','Ringo',894],
12:[12,'GA','F','1912','Louise',436],
13:[13,'GA','M','1912','Ringo',354],
14:[14,'GA','F','1912','Louise',654],
15:[15,'GA','F','1912','Ruby',325],
16:[16,'GA','M','1912','Ringo',846],
17:[17,'GA','F','1912','Louise',100],
18:[17,'GA','F','1912','Louise',98]},
index=['Idx','State','Sex','Year','Name','Count']).T
Output: df>>>
Idx State Sex Year Name Count
0 0 GA F 1910 Mary 841
1 1 GA F 1910 Annie 553
2 2 GA F 1910 Mattie 320
3 3 GA F 1910 Ruby 279
4 4 GA F 1910 Willie 275
5 5 GA F 1910 Louise 231
6 6 GA M 1911 Paul 234
7 7 GA F 1911 Mattie 745
8 8 GA M 1911 Paul 654
9 9 GA M 1911 Paul 105
10 10 GA F 1911 Louise 76
11 11 GA M 1911 Ringo 894
12 12 GA F 1912 Louise 436
13 13 GA M 1912 Ringo 354
14 14 GA F 1912 Louise 654
15 15 GA F 1912 Ruby 325
16 16 GA M 1912 Ringo 846
17 17 GA F 1912 Louise 100
18 17 GA F 1912 Louise 98
First be sure that the 'Count' row is numeric by using this:
df['Count'] = pd.to_numeric(df['Count'])
Then just do this:
df[['Year','Name','Count']].loc[df.groupby('Year')['Count'].idxmax()]
Output: df>>
Year Name Count
0 1910 Mary 841
11 1911 Ringo 894
16 1912 Ringo 846