Home > Software design >  pandas split string list of 2/3-word actor names into separate columns
pandas split string list of 2/3-word actor names into separate columns

Time:06-26

I am trying to separate a list of actors in the Starring column of a dataframe, and add them as separate columns. This is what it looks like now:

df.Starring.value_counts().head(50)

none                                                                                                            
Choi Kang hee Kwon Sang woo                                                                                      
Kim Jun myeon Ha Yeon soo Oh Chang suk Kim Ye won                                                                
Yoon Shi yoon Cho Soo hyang                                                                                      
Kim Jaewon Kim Ha neul                                                                                           
No Min woo Hong Ah reum                                                                                          
Jang Hyuk Lee Da hae                                                                                             
So Ji sub Kim Ha neul Yoon Kye sang                                                                              
Kim So hyun Na In woo Lee Ji hoon Choi Yu hwa                                                                    
Ji Hyun woo Lee Si young Kim Jin yeop Yoon Joo hee                                                               
Uhm Tae woong Lee Si young Lee Soo hyuk                                                                          
Doh Kyung soo Nam Ji hyun Jo Sung ha Jo Han chul Kim Seon ho Han So hee Kim Jae young                            
Sung Yu ri Jung Gyu woon Kim Min jun Min Hyo rin                                                                 
Yoon Joo sang Hong Eun hee Jeon Hye bin Kim Kyung nam Go Won hee Lee Bo hee Lee Byung joon Choi Dae chul         
Choi Si won Kang So ra Gong Myung                                                                                
Shin Da eun Lee Jae hwang Kim Hae in Seo Do young                                                                
Yeo Jin goo Lee Yeon hee Ahn Jae hyun                                                                            
Go Hyun jung Park Jin hee Lee Jin wook Shin Sung rok Bong Tae gyu Park Ki woong Jung Eun chae Yoon Jong hoon 

I have tried to separate it first with a comma and then later on explode() the Starring column but I am stuck at finding a way to separate the actors in each row perfectly.

I tried this and it did nothing. It did not even add the commas

df['Starring'] = df['Starring'].apply(lambda x: str(x) ',')

So I tried this. It also did not even add the comma to separate it.

df.Starring.apply(lambda x: x ',' if len(x.split(' ')) == 2 else x)

I tried experimenting by using regex by using df.Starring.str.replace() and to no avail.

Desired output This is how I want it to look like. After I have added the comma and use it to split and explode the Starring column into Actor_1, Actor_2..etc and drop the Starring column.

 Actor_1        Actor_2          Actor_3           Actor_4

Choi Kang hee  Kwon Sang woo      none             none                                                                      
Kim Jun myeon  Ha Yeon soo    Oh Chang suk    Kim Ye won                                                                
Yoon Shi yoon  Cho Soo hyang       none            none                                                                  
Kim Jaewon     Kim Ha neul         none            none                                                                      
No Min woo     Hong Ah reum        none            none                                                                      
Jang Hyuk      Lee Da hae          none            none                                                                     

CodePudding user response:

You can use str.findall to get all actors:

out = (df['Starring'].str.findall(r'(\w  \w (?: [a-z] )?)').apply(pd.Series)
                     .rename(columns=lambda x: f"Actor_{x 1}"))
print(out)

# Output
          Actor_1        Actor_2        Actor_3        Actor_4       Actor_5        Actor_6         Actor_7         Actor_8
0   Choi Kang hee  Kwon Sang woo            NaN            NaN           NaN            NaN             NaN             NaN
1   Kim Jun myeon    Ha Yeon soo   Oh Chang suk     Kim Ye won           NaN            NaN             NaN             NaN
2   Yoon Shi yoon  Cho Soo hyang            NaN            NaN           NaN            NaN             NaN             NaN
3      Kim Jaewon    Kim Ha neul            NaN            NaN           NaN            NaN             NaN             NaN
4      No Min woo   Hong Ah reum            NaN            NaN           NaN            NaN             NaN             NaN
5       Jang Hyuk     Lee Da hae            NaN            NaN           NaN            NaN             NaN             NaN
6       So Ji sub    Kim Ha neul  Yoon Kye sang            NaN           NaN            NaN             NaN             NaN
7     Kim So hyun      Na In woo    Lee Ji hoon    Choi Yu hwa           NaN            NaN             NaN             NaN
8     Ji Hyun woo   Lee Si young   Kim Jin yeop   Yoon Joo hee           NaN            NaN             NaN             NaN
9   Uhm Tae woong   Lee Si young   Lee Soo hyuk            NaN           NaN            NaN             NaN             NaN
10  Doh Kyung soo    Nam Ji hyun     Jo Sung ha    Jo Han chul   Kim Seon ho     Han So hee   Kim Jae young             NaN
11     Sung Yu ri  Jung Gyu woon    Kim Min jun    Min Hyo rin           NaN            NaN             NaN             NaN
12  Yoon Joo sang   Hong Eun hee   Jeon Hye bin  Kim Kyung nam    Go Won hee     Lee Bo hee  Lee Byung joon   Choi Dae chul
13    Choi Si won     Kang So ra     Gong Myung            NaN           NaN            NaN             NaN             NaN
14    Shin Da eun  Lee Jae hwang     Kim Hae in   Seo Do young           NaN            NaN             NaN             NaN
15    Yeo Jin goo   Lee Yeon hee   Ahn Jae hyun            NaN           NaN            NaN             NaN             NaN
16   Go Hyun jung   Park Jin hee   Lee Jin wook  Shin Sung rok  Bong Tae gyu  Park Ki woong   Jung Eun chae  Yoon Jong hoon
  • Related