I'm relatively new to python and am really having trouble working with lists.
I have a dataframe (df1) with a column for 'actors' with many actors in a string and I have a separate dataframe (df2) that lists actors that have received an award.
I want to add a column to df1 that will indicate whether an actor has received an award or not, so for example 1=award, 0=no award.
I am trying to use for loops but it is not iterating in the way I want.
In my example, only 'Betty' has an award, so the 'actors_with_awards' column should display a 0 for the first row and 1 for the second, but the result is a 1 for both rows.
I suspect this is because it's looking at the string in its entirety for example is 'Alexander, Ann' in the list vs. is 'Alexander' or 'Ann' in the list, I thought splitting the stings would solve this (maybe I did that step wrong?) so I'm not sure how to fix this.
My full code is below:
import pandas as pd
# Creating sample dataframes
df1 = pd.DataFrame()
df1['cast']=['Alexander, Ann','Bob, Bill, Benedict, Betty']
df2 = pd.DataFrame()
df2['awards']=['Betty']
# Creating lists of actors, and Splitting up the string
actor_split=[]
for x in df1['cast']:
actor_split.append(x.split(','))
# Creating a list of actors who have received an award
award=[]
for x in df2['awards']:
award.append(x)
# Attempting to create a list of actors in Df1 who have received an award
actors_with_awards = []
for item in actor_split:
if x in item not in award:
actors_with_awards.append(0)
else:
actors_with_awards.append(1)
df1['actors_with_awards']=actors_with_awards
df1
Current Output Df1
cast | actors_with_awards |
---|---|
Alexander, Ann | 1 |
Bob, Bill, Benedict, Betty | 1 |
Expected Output Df1
cast | actors_with_awards |
---|---|
Alexander, Ann | 0 |
Bob, Bill, Benedict, Betty | 1 |
CodePudding user response:
One possible solution is to convert actors with awards from df2
to set
, split column df1['cast']
and check intersection between each for and the set:
awards = set(df2["awards"].values)
df1["actors_with_awards"] = [
int(bool(awards.intersection(a)))
for a in df1["cast"].str.split(r"\s*,\s*", regex=True)
]
print(df1)
Prints:
cast actors_with_awards
0 Alexander, Ann 0
1 Bob, Bill, Benedict, Betty 1
CodePudding user response:
When trying out your program, a couple of things popped up. First was your comparison of "x" to see if it was contained in the awards database.
for item in actor_split:
if x in item not in award:
actors_with_awards.append(0)
else:
actors_with_awards.append(1)
The issue here was that x contains the value of "Betty" left over from populating the awards array. It is not the "x" value for each split actor array. The other issue was when performing a check whether an item existed or not existed in the awards array, leading and/or trailing spaces for the actor names was throwing off the comparison.
With that in mind I made a few tweaks to your code to address those situations as follows in the code snippet.
import pandas as pd
# Creating sample dataframes
df1 = pd.DataFrame()
df1['cast']=['Alexander, Ann','Bob, Bill, Benedict, Betty']
df2 = pd.DataFrame()
df2['awards']=['Betty']
# Creating lists of actors, and Splitting up the string
actor_split=[]
for x in df1['cast']:
actor_split.append(x.split(','))
# Creating a list of actors who have received an award
award=[]
for x in df2['awards']:
award.append(x.strip()) # Make sure no leading or trailing spaces exist for subsequent test
# Attempting to create a list of actors in Df1 who have received an award
actors_with_awards = []
for item in actor_split:
y = 0
for x in item: # Reworked this so that "x" is associated with the selected actor set
if x.strip() not in award: # Again, make sure no leading or trailing spaces are in the comparison
y = 0
else:
y = 1
actors_with_awards.append(y)
df1['actors_with_awards']=actors_with_awards
print(df1) # Changed this so as to print out the data to a terminal
To insure that leading or trailing spaces would not trip of comparisons or list checks, I added in the ".strip()" function where needed to store just the name value and nothing more. Secondly, so that the proper name value was placed into variable "x", an additional for loop was added along with a work variable to be populated with the proper "0" or "1" value. Adding those tweaks resulted in the following raw data output on the terminal.
cast actors_with_awards
0 Alexander, Ann 0
1 Bob, Bill, Benedict, Betty 1
You may want to give that a try. Please note that this may be just one way to address this issue, but I hope it clarifies things for you.
Regards.