I have a csv file that contains n rows of sales of houses.
House | House_type | Sale_year |
---|---|---|
One | Semi | 2010 |
two | Flat | 2011 |
three | bungalow | 2012 |
four | Semi | 2013 |
five | Semi | 2013 |
I want to groupby the data by House_type (flat, bungalow, semi) by sale_year (2010,2011,etc) counts as columns. So I'm trying to output the data in the below format.
House_type | 2010 | 2011 | 2012 | 2013 |
---|---|---|---|---|
Semi | 1 | 0 | 0 | 2 |
Flat | 0 | 1 | 0 | 0 |
bungalow | 0 | 0 | 1 | 0 |
However, when I run the code, it returns both House_type and Sale_year as two columns.
house= housedata.groupby(["House_type", "Sale_year"])["Sale_year"].count()
house
House_type Sale_year
Flat 2011.0 1
bungalow 2012.0 1
Semi 2010.0 1
2013.0 2
How do I get pandas to output the data desired?
Many thanks
CodePudding user response:
You can achieve the same using get_dummies method of pandas. It basically creates multiple columns for a categorical column and fills it with values.
df = pd.DataFrame({'House_type':['Semi','Flat','Bungalow','Semi','Semi'],'sale_year':[2010,2011,2012,2013,2013]})
df_final = pd.get_dummies(df,columns=['sale_year']).groupby('House_type').sum()
df_final
CodePudding user response:
You can use pivot_table
here:
result = pd.pivot_table(df, index='House_type', columns='Sale_year',
aggfunc='count', fill_value=0)
It gives directly:
House
Sale_year 2010 2011 2012 2013
House_type
Flat 0 1 0 0
Semi 1 0 0 2
bungalow 0 0 1 0
You can format it a little if you want:
result.columns = result.columns.get_level_values(1).rename(None)
result = result.reset_index()
to finally get:
House_type 2010 2011 2012 2013
0 Flat 0 1 0 0
1 Semi 1 0 0 2
2 bungalow 0 0 1 0