How to efficiently get a string of values stored in a dataframe from a string of ids?-CodePudding

I'm working with two dataframes:

1. name_basics:

 nconst      primaryName
0  nm0000001     Fred Astaire
1  nm0000002    Lauren Bacall
2  nm0000003  Brigitte Bardot
3  nm0000004     John Belushi
4  nm0000005   Ingmar Bergman

2. title_directors:

 tconst            directors
0  tt0000574            nm0846879
1  tt0000591            nm0141150
2  tt0000679  nm0091767,nm0877783
3  tt0001184  nm0063413,nm0550220
4  tt0001258            nm0088881

What i'm looking for is something like this:

tconst            directors
0  tt0000574      name1
1  tt0000591      name2
2  tt0000679      name3,name4
3  tt0001184      name5,name6
4  tt0001258      name7

I tried to do so by defining a function getNames and then doing

title_directors['directors'] = title_directors['directors'].apply(getNames)

def getNames(nconsts):
    nconstList = nconsts.split(',')
    retString = ''
    for nconst in nconstList:
        df = name_basics[name_basics['nconst'] == nconst]['primaryName']
        if not df.empty:
            retString  = df.iloc[0]   ', '
    return retString[:-2]

getNames('nm0063413,nm0550220')
'Ricardo de Baños, Alberto Marro'

But this is way too slow (it would take like 80hrs on my computer). I was wondering what is the correct way to do this.

Thank you very much

CodePudding user response：

One approach:

import pprint

import pandas as pd

name_basics = pd.read_csv("input.csv")
title_directors = pd.read_csv("filer.csv", delim_whitespace=True)

print(name_basics)
print(title_directors)

# split by comma
title_directors["directors"] = title_directors["directors"].str.split(",")

# explode the DataFrame
td = title_directors.explode("directors")

# map the name_basics DataFrame onto the directors column
td["directors"] = td["directors"].map(name_basics.set_index("nconst").squeeze()).fillna("")

# un-explode
res = td.groupby("tconst").agg(",".join).reset_index()

# remove unwanted commnas from the new column
res["directors"] = res["directors"].str.strip(",")
print(res)

For a name_basics DataFrame as follows:

      nconst      primaryName
0  nm0000001     Fred Astaire
1  nm0000002    Lauren Bacall
2  nm0000003  Brigitte Bardot
3  nm0000004     John Belushi
4  nm0000005   Ingmar Bergman

and a title_directors as follows:

      tconst            directors
0  tt0000574            nm0000001
1  tt0000591            nm0141150
2  tt0000679  nm0091767,nm0000004
3  tt0001184  nm0063413,nm0550220
4  tt0001258            nm0088881

it gives the following output:

      tconst     directors
0  tt0000574  Fred Astaire
1  tt0000591              
2  tt0000679  John Belushi
3  tt0001184              
4  tt0001258