I have a dataframe of my city that contains the people's evaluation of neighbourhood problems and looks something like that:
personID hood problems problemTop1 problemTop2 problemTop3 other columns
0001 A ['cars', 'air', 'trash', 'parks', ...] 'trash' 'cars' 'air' ...
0002 D ['cars', 'air', 'trash', 'parks', ...] 'cars' 'air' 'trash' ...
0003 A ['cars', 'air', 'trash', 'parks', ...] 'cars' 'parks' 'air' ...
0004 M ['cars', 'air', 'trash', 'parks', ...] 'parks' 'cars' 'air' ...
...
Here is a short version of the dataframe with only the important columns. I would like to group the dataframe by neighbourhoods (hood
) and give each of the possible problems a score. The score should be a weighted sum of how problematic the people perceived the problem and as an example should look like that - if in neighbourhood E
10 people perceived trash
as problemTop1
, 7 people as problemTop2
and 3 as problemTop3
, then
E_score_trash = 10*3 7*2 3*1 = 47
My final output should show for each neighbourhood each problem's score, like this:
>>> df_scores_by_hood
hood 'air' 'cars' 'parks' 'trash'
A 386 871 321 984
B 436 719 504 620
C 383 230 848 601
D 881 524 231 783
...
Is there an easy way to do this with pandas, numpy or other functions? Tnx
CodePudding user response:
I found a solution, but I am not happy with it - it's not really pandasic, it's not vectorized (it does loops) and I have the feeling there should be something simpler for my problem. Do you know how to improve it?
# list of all the problems
problems = ['cars', 'air', 'trash',...]
for problem in problems:
df[problem] = 0
df[problem] = df.apply(lambda row: 3 if row['problemTop1']==problem
else 2 if row['problemTop2']==problem
else 1 if row['problemTop3']==problem
else 0, axis=1)
CodePudding user response:
EDIT
Did not fully understand the question originally, thus the initailly wrong answer. This
problems = {'problemTop1':10,'problemTop2':7, 'problemTop3':3}
agg_df = pd.DataFrame()
for problem in problems:
temp = pd.crosstab(df['hood'],df[problem],colnames=['problem'])
temp = temp*problems[problem]
agg_df = agg_df.add(temp,fill_value=0)
agg_df.astype(int)
should do the task in an efficient way.
Original
you need to explode the list column by
df = df.explode('problems')
then you can do a crosstab or in my experience better perform a double groupby and unstack.
df_scores_by_hood = df.groupby(['hood','problems'])['personID'].count().unstack()