Home > Enterprise >  GroupBy in a python-Pandas dataframe with score corresponding to occurence in columns
GroupBy in a python-Pandas dataframe with score corresponding to occurence in columns

Time:03-26

I have a dataframe of my city that contains the people's evaluation of neighbourhood problems and looks something like that:

personID hood problems                               problemTop1 problemTop2 problemTop3 other columns
0001     A    ['cars', 'air', 'trash', 'parks', ...] 'trash'     'cars'      'air'       ...
0002     D    ['cars', 'air', 'trash', 'parks', ...] 'cars'      'air'       'trash'     ...
0003     A    ['cars', 'air', 'trash', 'parks', ...] 'cars'      'parks'     'air'       ...
0004     M    ['cars', 'air', 'trash', 'parks', ...] 'parks'     'cars'      'air'       ...
...

Here is a short version of the dataframe with only the important columns. I would like to group the dataframe by neighbourhoods (hood) and give each of the possible problems a score. The score should be a weighted sum of how problematic the people perceived the problem and as an example should look like that - if in neighbourhood E 10 people perceived trash as problemTop1, 7 people as problemTop2 and 3 as problemTop3, then

E_score_trash = 10*3 7*2 3*1 = 47

My final output should show for each neighbourhood each problem's score, like this:

>>> df_scores_by_hood
hood 'air' 'cars' 'parks' 'trash'
A     386   871    321     984
B     436   719    504     620
C     383   230    848     601
D     881   524    231     783
...

Is there an easy way to do this with pandas, numpy or other functions? Tnx

CodePudding user response:

I found a solution, but I am not happy with it - it's not really pandasic, it's not vectorized (it does loops) and I have the feeling there should be something simpler for my problem. Do you know how to improve it?

# list of all the problems
problems = ['cars', 'air', 'trash',...]

for problem in problems:
    df[problem] = 0
    df[problem] = df.apply(lambda row: 3 if row['problemTop1']==problem
                           else 2 if row['problemTop2']==problem
                           else 1 if row['problemTop3']==problem
                           else 0, axis=1)

CodePudding user response:

EDIT

Did not fully understand the question originally, thus the initailly wrong answer. This

problems =  {'problemTop1':10,'problemTop2':7, 'problemTop3':3}
agg_df = pd.DataFrame()
for problem in problems:
    temp = pd.crosstab(df['hood'],df[problem],colnames=['problem'])
    temp = temp*problems[problem]
    agg_df = agg_df.add(temp,fill_value=0)
agg_df.astype(int)

should do the task in an efficient way.


Original

you need to explode the list column by

df = df.explode('problems')

then you can do a crosstab or in my experience better perform a double groupby and unstack.

df_scores_by_hood = df.groupby(['hood','problems'])['personID'].count().unstack()
  • Related