I have the following dataframe where the cities are columns and ages are the values:
City1 | City2 | City3 |
---|---|---|
2 | 14 | 61 |
51 | 73 | 35 |
42 | 38 | 13 |
12 | 75 | 24 |
27 | 42 | 78 |
I want to create a new dataframe where the columns are age groups, and the cities are the index, like so:
0-20 | 20-40 | 40-60 | 60-80 | |
---|---|---|---|---|
City1 | 2 | 1 | 1 | 0 |
City2 | 1 | 1 | 1 | 0 |
City3 | 1 | 2 | 0 | 2 |
Is this possible to do in pandas?
CodePudding user response:
Try this, using pd.cut
:
dfc = pd.cut(df.rename_axis('Cities', axis=1).stack(),
bins=[-np.inf,20,40,60,np.inf],
labels='0-20 20-40 40-60 60-80'.split(' ')).reset_index()
pd.crosstab(dfc['Cities'], dfc[0]).reset_index()
Output:
0 Cities 0-20 20-40 40-60 60-80
0 City1 2 1 2 0
1 City2 1 1 1 2
2 City3 1 2 0 2
CodePudding user response:
#this should work
import pandas as pd
#creating df
data = [[2, 14, 61], [51, 73, 35], [42, 38, 13], [12, 75, 24], [27, 42, 78]]
df = pd.DataFrame(data, columns = ['city1', 'city2', 'city3'])
#sorting by given intervals
data_new = [[df[(df['city1'] > 0) & (df['city1'] <= 20)]['city1'].count(), df[(df['city1'] > 20) & (df['city1'] <= 40)]['city1'].count(), df[(df['city1'] > 40) & (df['city1'] <= 60)]['city1'].count(), df[(df['city1'] > 60) & (df['city1'] <= 80)]['city1'].count()], [df[(df['city2'] > 0) & (df['city2'] <= 20)]['city2'].count(), df[(df['city2'] > 20) & (df['city2'] <= 40)]['city2'].count(), df[(df['city2'] > 40) & (df['city2'] <= 60)]['city2'].count(), df[(df['city2'] > 60) & (df['city2'] <= 80)]['city2'].count()], [df[(df['city3'] > 0) & (df['city3'] <= 20)]['city3'].count(),df[(df['city3'] > 20) & (df['city3'] <= 40)]['city3'].count(), df[(df['city3'] > 40) & (df['city3'] <= 60)]['city3'].count(), df[(df['city3'] > 60) & (df['city3'] <= 80)]['city3'].count()]]
#creating a new df with new data
df_new = pd.DataFrame(data_new, index= ['city1', 'city2', 'city3'], columns= ['0-20', '20-40', '40-60', '60-80'])
#so the point is to add this "index= ['city1', 'city2', 'city3']," between data and columns when you create a new dataframe
CodePudding user response:
Here is a solution using pd.Series.between
for all combinations of the range and the citys.
new_data = []
for city in df.columns:
new_city = []
for left, right in [(0,20),(20,40),(40,60),(60,80)]:
new_city.append(df[city].between(left,right, inclusive="left").sum())
new_data.append(new_city)
new_df = pd.DataFrame(new_data, columns=["0-20","20-40","40-60","60-80"], index=[df.columns])
new_df