Home > Software engineering >  Data Cleaning: Counting the number of male actors in Python
Data Cleaning: Counting the number of male actors in Python

Time:05-24

My main data frame looks like this enter image description here

Below is what the entirety of a cell in 'cast' column looks like. Trying to identify how many male actors are in 1 film Id.

[{'cast_id': 9, 'character': 'Count Dracula', 'credit_id': '52fe44b79251416c7503e811', 'gender': 2, 'id': 7633, 'name': 'Leslie Nielsen', 'order': 0, 'profile_path': '/cZpuTfE1j63tCEXoSL2A7KZjk5d.jpg'}, {'cast_id': 10, 'character': 'Prof. Abraham Van Helsing', 'credit_id': '52fe44b79251416c7503e815', 'gender': 2, 'id': 14639, 'name': 'Mel Brooks', 'order': 1, 'profile_path': '/ndFo3LOYNCUghQTK833N1Wtuynr.jpg'}, {'cast_id': 11, 'character': 'Mina Seward', 'credit_id': '52fe44b79251416c7503e819', 'gender': 1, 'id': 1219, 'name': 'Amy Yasbeck', 'order': 2, 'profile_path': '/c7WOXDlKE9cwTBjpUtCKFPqnptr.jpg'}, {'cast_id': 12, 'character': 'Thomas Renfield', 'credit_id': '52fe44b79251416c7503e81d', 'gender': 2, 'id': 12688, 'name': 'Peter MacNicol', 'order': 3, 'profile_path': '/hw4KTj66xh2FupS5sQMeNMNL6lx.jpg'}, {'cast_id': 13, 'character': 'Lucy Westenra', 'credit_id': '52fe44b79251416c7503e821', 'gender': 1, 'id': 12812, 'name': 'Lysette Anthony', 'order': 4, 'profile_path': '/yLTijECWVviLdhB03IYkZSJJLcY.jpg'}, {'cast_id': 14, 'character': 'Dr. Jack Seward', 'credit_id': '52fe44b79251416c7503e825', 'gender': 2, 'id': 13640, 'name': 'Harvey Korman', 'order': 5, 'profile_path': '/zXLYvJP3ReKPI6lJr2VuDGupL1j.jpg'}, {'cast_id': 15, 'character': 'Jonathan Harker', 'credit_id': '52fe44b79251416c7503e829', 'gender': 2, 'id': 6106, 'name': 'Steven Weber', 'order': 6, 'profile_path': '/ujINzDjLNtELBSUkRHQgExzP4Fb.jpg'}, {'cast_id': 17, 'character': 'Martin', 'credit_id': '52fe44b79251416c7503e833', 'gender': 2, 'id': 29709, 'name': 'Mark Blankfield', 'order': 7, 'profile_path': '/oAtZrKAA1c8XJLBBfW1oXYYt8jS.jpg'}, {'cast_id': 18, 'character': 'Essie', 'credit_id': '52fe44b79251416c7503e837', 'gender': 1, 'id': 53570, 'name': 'Megan Cavanagh', 'order': 8, 'profile_path': '/xZML4JXgD7Yd0f19hXhq7bXLXfC.jpg'}, {'cast_id': 19, 'character': 'Woodbridge', 'credit_id': '52fe44b79251416c7503e83b', 'gender': 2, 'id': 170185, 'name': 'Gregg Binkley', 'order': 9, 'profile_path': '/gTmsOFfFqHDcEzeqh3ezcSuk7TB.jpg'}, {'cast_id': 20, 'character': 'Madame Ouspenskaya', 'credit_id': '56fc3c8dc3a36808a70033f6', 'gender': 1, 'id': 10774, 'name': 'Anne Bancroft', 'order': 10, 'profile_path': '/4VMhut6tvXqXBmMGFRjXbbImAZW.jpg'}]

CodePudding user response:

I think you can use df.value_counts() to get counts of unique values and filter out value 2 using df.value_counts()[2]

import pandas as pd

df = pd.DataFrame([{'cast_id': 9, 'character': 'Count Dracula', 'credit_id': '52fe44b79251416c7503e811', 'gender': 2, 'id': 7633, 'name': 'Leslie Nielsen', 'order': 0, 'profile_path': '/cZpuTfE1j63tCEXoSL2A7KZjk5d.jpg'}, {'cast_id': 10, 'character': 'Prof. Abraham Van Helsing', 'credit_id': '52fe44b79251416c7503e815', 'gender': 2, 'id': 14639, 'name': 'Mel Brooks', 'order': 1, 'profile_path': '/ndFo3LOYNCUghQTK833N1Wtuynr.jpg'}, {'cast_id': 11, 'character': 'Mina Seward', 'credit_id': '52fe44b79251416c7503e819', 'gender': 1, 'id': 1219, 'name': 'Amy Yasbeck', 'order': 2, 'profile_path': '/c7WOXDlKE9cwTBjpUtCKFPqnptr.jpg'}, {'cast_id': 12, 'character': 'Thomas Renfield', 'credit_id': '52fe44b79251416c7503e81d', 'gender': 2, 'id': 12688, 'name': 'Peter MacNicol', 'order': 3, 'profile_path': '/hw4KTj66xh2FupS5sQMeNMNL6lx.jpg'}, {'cast_id': 13, 'character': 'Lucy Westenra', 'credit_id': '52fe44b79251416c7503e821', 'gender': 1, 'id': 12812, 'name': 'Lysette Anthony', 'order': 4, 'profile_path': '/yLTijECWVviLdhB03IYkZSJJLcY.jpg'}, {'cast_id': 14, 'character': 'Dr. Jack Seward', 'credit_id': '52fe44b79251416c7503e825', 'gender': 2, 'id': 13640, 'name': 'Harvey Korman', 'order': 5, 'profile_path': '/zXLYvJP3ReKPI6lJr2VuDGupL1j.jpg'}, {'cast_id': 15, 'character': 'Jonathan Harker', 'credit_id': '52fe44b79251416c7503e829', 'gender': 2, 'id': 6106, 'name': 'Steven Weber', 'order': 6, 'profile_path': '/ujINzDjLNtELBSUkRHQgExzP4Fb.jpg'}, {'cast_id': 17, 'character': 'Martin', 'credit_id': '52fe44b79251416c7503e833', 'gender': 2, 'id': 29709, 'name': 'Mark Blankfield', 'order': 7, 'profile_path': '/oAtZrKAA1c8XJLBBfW1oXYYt8jS.jpg'}, {'cast_id': 18, 'character': 'Essie', 'credit_id': '52fe44b79251416c7503e837', 'gender': 1, 'id': 53570, 'name': 'Megan Cavanagh', 'order': 8, 'profile_path': '/xZML4JXgD7Yd0f19hXhq7bXLXfC.jpg'}, {'cast_id': 19, 'character': 'Woodbridge', 'credit_id': '52fe44b79251416c7503e83b', 'gender': 2, 'id': 170185, 'name': 'Gregg Binkley', 'order': 9, 'profile_path': '/gTmsOFfFqHDcEzeqh3ezcSuk7TB.jpg'}, {'cast_id': 20, 'character': 'Madame Ouspenskaya', 'credit_id': '56fc3c8dc3a36808a70033f6', 'gender': 1, 'id': 10774, 'name': 'Anne Bancroft', 'order': 10, 'profile_path': '/4VMhut6tvXqXBmMGFRjXbbImAZW.jpg'}])
df['gender'].value_counts()[2]

Edit: If the cast is stored in a single pandas column, you can use df.apply method to go to each cast list to calculate the number of male cast per film and store in a new column ('male_count' in the example below). You can change the data variable to be the name of the variable that has the dataframe

data['male_count'] = data.apply(lambda x: len([cast['gender'] for cast in x[0] if cast['gender'] == 2]), axis=1)

CodePudding user response:

You can count how many genders in a list of dictionary objects by iterating in a simple way:

I think it is so easy with groupby

# your dataframe is df 
df.groupby('gender').count()
  • Related