I have a baseball stats df wherein some of the columns names are:
- "Games" - which indicates the number of games a player has played in that year.
- "Year" - which contains values between 1990 and 2020
- "PlayerID" - not unique. Basically shows which player played for which "Team" in what "Year"
- "Home Runs" - Number of nome Runs hit by that player in that year.
I want to find out:
- who has the max number of home runs where the number of games played is greater than 100.
- which column/ feature has the highest correlation with "Home Run"
Data is in the form of a csv file.
CodePudding user response:
Looks like you can use a groupby
and transform
methods aggregate some of the data. There is also a pandasql library that you could import and use SQL to query the dataframe.
Link to similar question: Run sql query on pandas dataframe
CodePudding user response:
df = df[df['Games'] > 100]
will keep only lines with more than 100 games. Thendf['Home Runs'].max()
will give you the max number.Finding the maximum value (excluding
Home Runs
) indf.corr()['Home Runs'].argmax()
should give you the most correlated column.
NB: You might have to do some aggregation first like df = df.groupby(['PlayerID']).sum()
but you said PlayerID
is not a unique identifier so you might have to investigate this a bit more in depth. And .sum()
aggregation only makes sense for Games
and Home Runs
so be careful if you need the Year
column in other computations.