I'm new to python, and I could use a push in the right direction for what I think should be (?) a pretty simple problem. I've got a dataframe (genres_df) with one column:
0
0 Horror
1 Comedy
2 Fantasy
3 Adventure
4 Drama
5 Animation
6 Crime
...
and a dataframe (df) with 3 columns--one for each genre associated with the film, and one row for each film I'm looking at):
0 1 2
0 Horror Short None
1 Horror Short None
2 Comedy Horror Short
3 Comedy Horror Short
4 Fantasy Horror Short
...
I want to count the number of rows in the dataframe genres that contain each item in genres_df. I was able to do this by hand, with a sum line:
sum(df[0] == 'Comedy') sum(df[1] == 'Comedy') sum(df[2] == 'Comedy')
I know this works, because there is a Horror item in each row, and there are 78471 rows in df.
I want to get a dataframe that has two columns: the genre (from genres_df) and the count of rows in which that genre appears, across any of the columns in df. Like so
0 1
0 Horror 78471
1 Comedy 9903
...
Here's what I've got so far:
df_counts = pd.DataFrame(columns = ['genre','count'])
for i in genres_df[0]:
s_row = pd.Series(i,sum(df[0]==i) sum(df[1]==i) sum(df[2]==i))
df_counts.append(s_row,ignore_index=True)
But this doesn't work. It seems to be the closest I've gotten, though. Help?
CodePudding user response:
I think what you need to do is the following
- melt your dataframe (i.e turn the column into rows)
- apply a groupby operation & count
Finally.
- filter the genre column with the genre's from your first df
df1 = pd.melt(df,value_name='genre').groupby('genre',
as_index=False).agg(count=('genre','count'))
print(df1)
genre count
0 Comedy 2
1 Fantasy 1
2 Horror 5
3 None 2
4 Short 5
final = df1[df1['genre'].isin(all_genre['0'])]
genre count
0 Comedy 2
1 Fantasy 1
2 Horror 5
As a side note - if you're using methods outside of the pandas API then you're most likely doing something incorrectly (as in against the design of the library)
CodePudding user response:
I think your solution is right but append
like any other pandas operation does not mutate the dataframe by default, so you have to reassign it (or use inplace=True
)
df_counts = pd.DataFrame(columns = ['genre','count'])
for i in genres_df[0]:
s_row = pd.Series(i,sum(df[0]==i) sum(df[1]==i) sum(df[2]==i))
df_counts = df_counts.append(s_row,ignore_index=True)