I'm working with two dataframes:
1. name_basics:
nconst primaryName
0 nm0000001 Fred Astaire
1 nm0000002 Lauren Bacall
2 nm0000003 Brigitte Bardot
3 nm0000004 John Belushi
4 nm0000005 Ingmar Bergman
2. title_directors:
tconst directors
0 tt0000574 nm0846879
1 tt0000591 nm0141150
2 tt0000679 nm0091767,nm0877783
3 tt0001184 nm0063413,nm0550220
4 tt0001258 nm0088881
What i'm looking for is something like this:
tconst directors
0 tt0000574 name1
1 tt0000591 name2
2 tt0000679 name3,name4
3 tt0001184 name5,name6
4 tt0001258 name7
I tried to do so by defining a function getNames and then doing
title_directors['directors'] = title_directors['directors'].apply(getNames)
def getNames(nconsts):
nconstList = nconsts.split(',')
retString = ''
for nconst in nconstList:
df = name_basics[name_basics['nconst'] == nconst]['primaryName']
if not df.empty:
retString = df.iloc[0] ', '
return retString[:-2]
getNames('nm0063413,nm0550220')
'Ricardo de Baños, Alberto Marro'
But this is way too slow (it would take like 80hrs on my computer). I was wondering what is the correct way to do this.
Thank you very much
CodePudding user response:
One approach:
import pprint
import pandas as pd
name_basics = pd.read_csv("input.csv")
title_directors = pd.read_csv("filer.csv", delim_whitespace=True)
print(name_basics)
print(title_directors)
# split by comma
title_directors["directors"] = title_directors["directors"].str.split(",")
# explode the DataFrame
td = title_directors.explode("directors")
# map the name_basics DataFrame onto the directors column
td["directors"] = td["directors"].map(name_basics.set_index("nconst").squeeze()).fillna("")
# un-explode
res = td.groupby("tconst").agg(",".join).reset_index()
# remove unwanted commnas from the new column
res["directors"] = res["directors"].str.strip(",")
print(res)
For a name_basics
DataFrame as follows:
nconst primaryName
0 nm0000001 Fred Astaire
1 nm0000002 Lauren Bacall
2 nm0000003 Brigitte Bardot
3 nm0000004 John Belushi
4 nm0000005 Ingmar Bergman
and a title_directors
as follows:
tconst directors
0 tt0000574 nm0000001
1 tt0000591 nm0141150
2 tt0000679 nm0091767,nm0000004
3 tt0001184 nm0063413,nm0550220
4 tt0001258 nm0088881
it gives the following output:
tconst directors
0 tt0000574 Fred Astaire
1 tt0000591
2 tt0000679 John Belushi
3 tt0001184
4 tt0001258