In Pandas, how can I create a pivot table that returns corresponding column value, for a max in anot-CodePudding

sorry if this sounds like a silly question, I'm very new to python and pandas and I've been having trouble with a pivot table. Any help would be appreciated.

I have a file which looks like this:

Idx State Sex  Year    Name  Count
0    GA   F  1910    Mary    841
1    GA   F  1910   Annie    553
2    GA   F  1910  Mattie    320
3    GA   F  1910    Ruby    279
4    GA   F  1910  Willie    275
5    GA   F  1910  Louise    231

I wanted to create a pivot table, which returns a column of sequential years [one row for each], along with a column for the Name which has the highest count, and a Count column showing that maximum eg

Year    Name  Count
1910    Mary    1841
1911    Paul    553
1912   June     9620

I've tried:

import pandas as pd
bn = pd.read_csv("baby_names.csv")
bn['Count'].astype(int)
by_yr = pd.pivot_table(data=bn, values="Count", index=['Year','Name'], aggfunc='max' )
print(by_yr.head(10))

as well as:

import pandas as pd
bn = pd.read_csv("baby_names.csv")
bn['Count'].astype(int)
by_yr = pd.pivot_table(data=bn, values=['Name',"Count"], index='Year', aggfunc='max' )
print(by_yr.head(10))

but neither of them work. I'm sure I'm missing something stupid but any help is appreciated. Thanks

CodePudding user response：

You can use groupby('Year') and nlargest to find, for each group/year, the row of the maximum Count.

step1 = df.groupby('Year').apply(lambda df: df.nlargest(1, 'Count'))

Then takes care of the formatting with

step1.droplevel(1, axis=0)[['Name', 'Count']].reset_index()

CodePudding user response：

try this:

I expanded your dataframe sample so we could be able to better verify the results:

df = pd.DataFrame({0:[0,'GA','F','1910','Mary',841],
                   1:[1,'GA','F','1910','Annie',553],
                   2:[2,'GA','F','1910','Mattie',320],
                   3:[3,'GA','F','1910','Ruby',279],
                   4:[4,'GA','F','1910','Willie',275],
                   5:[5,'GA','F','1910','Louise',231],
                   6:[6,'GA','M','1911','Paul',234],
                   7:[7,'GA','F','1911','Mattie',745],
                   8:[8,'GA','M','1911','Paul',654],
                   9:[9,'GA','M','1911','Paul',105],
                  10:[10,'GA','F','1911','Louise',76],
                  11:[11,'GA','M','1911','Ringo',894],
                  12:[12,'GA','F','1912','Louise',436],
                  13:[13,'GA','M','1912','Ringo',354],
                  14:[14,'GA','F','1912','Louise',654],
                  15:[15,'GA','F','1912','Ruby',325],
                  16:[16,'GA','M','1912','Ringo',846],
                  17:[17,'GA','F','1912','Louise',100],
                  18:[17,'GA','F','1912','Louise',98]}, 
               index=['Idx','State','Sex','Year','Name','Count']).T

Output: df>>>

  Idx   State   Sex   Year    Name  Count
0   0      GA     F   1910    Mary    841
1   1      GA     F   1910   Annie    553
2   2      GA     F   1910  Mattie    320
3   3      GA     F   1910    Ruby    279
4   4      GA     F   1910  Willie    275
5   5      GA     F   1910  Louise    231
6   6      GA     M   1911    Paul    234
7   7      GA     F   1911  Mattie    745
8   8      GA     M   1911    Paul    654
9   9      GA     M   1911    Paul    105
10  10     GA     F   1911  Louise     76
11  11     GA     M   1911   Ringo    894
12  12     GA     F   1912  Louise    436
13  13     GA     M   1912   Ringo    354
14  14     GA     F   1912  Louise    654
15  15     GA     F   1912    Ruby    325
16  16     GA     M   1912   Ringo    846
17  17     GA     F   1912  Louise    100
18  17     GA     F   1912  Louise     98

First be sure that the 'Count' row is numeric by using this:

df['Count'] = pd.to_numeric(df['Count'])

Then just do this:

df[['Year','Name','Count']].loc[df.groupby('Year')['Count'].idxmax()]

Output: df>>

    Year     Name   Count
0   1910     Mary     841
11  1911    Ringo     894
16  1912    Ringo     846