Currently I'm using pandas and numpy to play around with a data set on rain measurements in India, however I'm stumped on trying to create a particular column. Currently my data set looks like this:
SUBDIVISION | JAN | FEB | MAR | APR | MAY |
---|---|---|---|---|---|
Andaman & Nicobar Islands | 50 | 70 | 90 | 250 | 430 |
Arunachal Pradesh | 46 | 90 | 151 | 265 | 356 |
Assam & Meghalaya | 16 | 31 | 79 | 505 | 340 |
Bihar | 13 | 14 | 100 | 16 | 53 |
What I want is to replace all the columns that have the months with a single column "Months", and I want this column to contain the name of the month that has the most amount of rain, so for example it would look like this:
SUBDIVISION | Months |
---|---|
Andaman & Nicobar Islands | MAY |
Arunachal Pradesh | MAY |
Assam & Meghalaya | APR |
Bihar | MAR |
My data set is much larger than this so trying to manually input all of the data would not be worth it. So, I'm hoping there's a way to do what I'm wanting in Python
CodePudding user response:
Use
# get column name of max values in month columns
df.set_index('SUBDIVISION').idxmax(1).reset_index(name='Months')
CodePudding user response:
You can use pd.melt
to transform your data first.
import pandas as pd
df = pd.DataFrame({
'subdivision': ['a','b'],
'jun': [1,2],
'july': [2,1]
})
df = pd.melt(df, id_vars=['subdivision'], var_name='month', value_name='rain')
df
df:
subdivision | month | rain |
---|---|---|
a | jun | 1 |
b | jun | 2 |
a | july | 2 |
b | july | 1 |
Then, sort value with rain
value and drop_duplicates
subdivision
for keeping only the row having max rain value in each subdivision
df = df.sort_values('rain', ascending=False).drop_duplicates(['subdivision'])
Output:
subdivision | month | rain |
---|---|---|
b | jun | 2 |
a | july | 2 |