Make a new column for each category in a particular column and repeat this for all columns in a data-CodePudding

I have a dataset like below-:

df = pd.DataFrame({
'state':['California'] * 4   ['Florida'] * 5   ['Minnesota'] * 3   ['New Hampshire'],
'population':['50-100', '0-50', '150-200', '50-100', '0-50', '150-200',
         '100-150', 'NA', '0-50', 'NA', '100-150', '50-100', 'NA'],
'locale':['rural', 'urban', 'town', 'suburb', 'suburb', 'urban', 'rural', 'suburb', 'NA', 'town', 'town', 'urban', 'rural'] 
})

I want new columns for each category in all columns for each state. An example of a row is below-:

state        population=0-50 population=50-100 population=100-150 population=150-200 locale=rural locale=urban locale=town locale=suburb
California   1               2                 0                  1                  1          1              1            1

CodePudding user response：

You can play around with this-

df.groupby(['population', 'locale']).apply(lambda x: x['state'].value_counts().to_frame()).unstack().reset_index()

reference - stack thread

CodePudding user response：

Use pd.get_dummies Groupby.sum(), as follows:

(pd.get_dummies(df.set_index('state'))
   .groupby('state').sum()
   .reset_index()
)

Result:

           state  population_0-50  population_100-150  population_150-200  population_50-100  population_NA  locale_NA  locale_rural  locale_suburb  locale_town  locale_urban
0     California                1                   0                   1                  2              0          0             1              1            1             1
1        Florida                2                   1                   1                  0              1          1             1              2            0             1
2      Minnesota                0                   1                   0                  1              1          0             0              0            2             1
3  New Hampshire                0                   0                   0                  0              1          0             1              0            0             0

If you want to exclude the entries with value NA, you can use:

(pd.get_dummies(df[df != 'NA'].set_index('state'))
   .groupby('state').sum()
   .reset_index()
)

Result:

           state  population_0-50  population_100-150  population_150-200  population_50-100  locale_rural  locale_suburb  locale_town  locale_urban
0     California                1                   0                   1                  2             1              1            1             1
1        Florida                2                   1                   1                  0             1              2            0             1
2      Minnesota                0                   1                   0                  1             0              0            2             1
3  New Hampshire                0                   0                   0                  0             1              0            0             0