Home > Blockchain >  Querying a df in python3
Querying a df in python3

Time:05-25

I have a baseball stats df wherein some of the columns names are:

  1. "Games" - which indicates the number of games a player has played in that year.
  2. "Year" - which contains values between 1990 and 2020
  3. "PlayerID" - not unique. Basically shows which player played for which "Team" in what "Year"
  4. "Home Runs" - Number of nome Runs hit by that player in that year.

I want to find out:

  1. who has the max number of home runs where the number of games played is greater than 100.
  2. which column/ feature has the highest correlation with "Home Run"

Data is in the form of a csv file.

table head

CodePudding user response:

Looks like you can use a groupby and transform methods aggregate some of the data. There is also a pandasql library that you could import and use SQL to query the dataframe. Link to similar question: Run sql query on pandas dataframe

CodePudding user response:

  1. df = df[df['Games'] > 100] will keep only lines with more than 100 games. Then df['Home Runs'].max() will give you the max number.

  2. Finding the maximum value (excluding Home Runs) in df.corr()['Home Runs'].argmax() should give you the most correlated column.

NB: You might have to do some aggregation first like df = df.groupby(['PlayerID']).sum() but you said PlayerID is not a unique identifier so you might have to investigate this a bit more in depth. And .sum() aggregation only makes sense for Games and Home Runs so be careful if you need the Year column in other computations.

  • Related