Home > other >  pandas regex to put a comma after two or three words based on a condition (names of actors)?
pandas regex to put a comma after two or three words based on a condition (names of actors)?

Time:06-24

I am trying to separate a list of actors in the Starring column. This is what it looks like now.

df.Starring.value_counts().head(50)

none                                                                                                            
Choi Kang hee Kwon Sang woo                                                                                      
Kim Jun myeon Ha Yeon soo Oh Chang suk Kim Ye won                                                                
Yoon Shi yoon Cho Soo hyang                                                                                      
Kim Jaewon Kim Ha neul                                                                                           
No Min woo Hong Ah reum                                                                                          
Jang Hyuk Lee Da hae                                                                                             
So Ji sub Kim Ha neul Yoon Kye sang                                                                              
Kim So hyun Na In woo Lee Ji hoon Choi Yu hwa                                                                    
Ji Hyun woo Lee Si young Kim Jin yeop Yoon Joo hee                                                               
Uhm Tae woong Lee Si young Lee Soo hyuk                                                                          
Doh Kyung soo Nam Ji hyun Jo Sung ha Jo Han chul Kim Seon ho Han So hee Kim Jae young                            
Sung Yu ri Jung Gyu woon Kim Min jun Min Hyo rin                                                                 
Yoon Joo sang Hong Eun hee Jeon Hye bin Kim Kyung nam Go Won hee Lee Bo hee Lee Byung joon Choi Dae chul         
Choi Si won Kang So ra Gong Myung                                                                                
Shin Da eun Lee Jae hwang Kim Hae in Seo Do young                                                                
Yeo Jin goo Lee Yeon hee Ahn Jae hyun                                                                            
Go Hyun jung Park Jin hee Lee Jin wook Shin Sung rok Bong Tae gyu Park Ki woong Jung Eun chae Yoon Jong hoon 

I have tried to separate it first with a comma and then later on explode() the Starring column but I am stuck at finding a way to separate the actors in each row perfectly.

I tried this and it did nothing. It did not even add the commas

df['Starring'] = df['Starring'].apply(lambda x: str(x) ',')

So did this one. It also did not even add the comma to separate it.

df.Starring.apply(lambda x: x ',' if len(x.split(' ')) == 2 else x)

I tried experimenting by using regex by using df.Starring.str.replace() and to no avail.

This is how I want it too look like. After I have added the comma and use it to split and explode the Starring column into Actro_1 , Actor2..etc and drop the Starring column.

 Actor_1        Actor_2          Actor_3           Actor_4

Choi Kang hee  Kwon Sang woo      none             none                                                                      
Kim Jun myeon  Ha Yeon soo    Oh Chang suk    Kim Ye won                                                                
Yoon Shi yoon  Cho Soo hyang       none            none                                                                  
Kim Jaewon     Kim Ha neul         none            none                                                                      
No Min woo     Hong Ah reum        none            none                                                                      
Jang Hyuk      Lee Da hae          none            none                                                                     

Thanks again for your help.

CodePudding user response:

You can use .extractall() to extract the names (two title-cased words followed by an optional third lowercase word), and then use .pivot() to list out the names on a single row. Finally, you can rename all of the columns using a list comprehension:

df = pd.DataFrame(df["none"].str.extractall(r"([A-Z][a-z]  [A-Z][a-z] (?: [a-z] )?)")).reset_index()
df = df.pivot(index=["level_0"], columns=["match"]).fillna("none")
df.columns = [f"Actor_{i   1}" for i in df.columns.get_level_values(1)]
df

This outputs:

               Actor_1        Actor_2        Actor_3  ...        Actor_6         Actor_7         Actor_8
level_0                                               ...
0        Choi Kang hee  Kwon Sang woo           none  ...           none            none            none
1        Kim Jun myeon    Ha Yeon soo   Oh Chang suk  ...           none            none            none
2        Yoon Shi yoon  Cho Soo hyang           none  ...           none            none            none
3           Kim Jaewon    Kim Ha neul           none  ...           none            none            none
4           No Min woo   Hong Ah reum           none  ...           none            none            none
5            Jang Hyuk     Lee Da hae           none  ...           none            none            none
6            So Ji sub    Kim Ha neul  Yoon Kye sang  ...           none            none            none
7          Kim So hyun      Na In woo    Lee Ji hoon  ...           none            none            none
8          Ji Hyun woo   Lee Si young   Kim Jin yeop  ...           none            none            none
9        Uhm Tae woong   Lee Si young   Lee Soo hyuk  ...           none            none            none
10       Doh Kyung soo    Nam Ji hyun     Jo Sung ha  ...     Han So hee   Kim Jae young            none
11          Sung Yu ri  Jung Gyu woon    Kim Min jun  ...           none            none            none
12       Yoon Joo sang   Hong Eun hee   Jeon Hye bin  ...     Lee Bo hee  Lee Byung joon   Choi Dae chul
13         Choi Si won     Kang So ra     Gong Myung  ...           none            none            none
14         Shin Da eun  Lee Jae hwang     Kim Hae in  ...           none            none            none
15         Yeo Jin goo   Lee Yeon hee   Ahn Jae hyun  ...           none            none            none
16        Go Hyun jung   Park Jin hee   Lee Jin wook  ...  Park Ki woong   Jung Eun chae  Yoon Jong hoon

CodePudding user response:

You can use str.findall to get all actors:

out = (df['Starring'].str.findall(r'(\w  \w (?: [a-z] )?)').apply(pd.Series)
                     .rename(columns=lambda x: f"Actor_{x 1}"))
print(out)

# Output
          Actor_1        Actor_2        Actor_3        Actor_4       Actor_5        Actor_6         Actor_7         Actor_8
0   Choi Kang hee  Kwon Sang woo            NaN            NaN           NaN            NaN             NaN             NaN
1   Kim Jun myeon    Ha Yeon soo   Oh Chang suk     Kim Ye won           NaN            NaN             NaN             NaN
2   Yoon Shi yoon  Cho Soo hyang            NaN            NaN           NaN            NaN             NaN             NaN
3      Kim Jaewon    Kim Ha neul            NaN            NaN           NaN            NaN             NaN             NaN
4      No Min woo   Hong Ah reum            NaN            NaN           NaN            NaN             NaN             NaN
5       Jang Hyuk     Lee Da hae            NaN            NaN           NaN            NaN             NaN             NaN
6       So Ji sub    Kim Ha neul  Yoon Kye sang            NaN           NaN            NaN             NaN             NaN
7     Kim So hyun      Na In woo    Lee Ji hoon    Choi Yu hwa           NaN            NaN             NaN             NaN
8     Ji Hyun woo   Lee Si young   Kim Jin yeop   Yoon Joo hee           NaN            NaN             NaN             NaN
9   Uhm Tae woong   Lee Si young   Lee Soo hyuk            NaN           NaN            NaN             NaN             NaN
10  Doh Kyung soo    Nam Ji hyun     Jo Sung ha    Jo Han chul   Kim Seon ho     Han So hee   Kim Jae young             NaN
11     Sung Yu ri  Jung Gyu woon    Kim Min jun    Min Hyo rin           NaN            NaN             NaN             NaN
12  Yoon Joo sang   Hong Eun hee   Jeon Hye bin  Kim Kyung nam    Go Won hee     Lee Bo hee  Lee Byung joon   Choi Dae chul
13    Choi Si won     Kang So ra     Gong Myung            NaN           NaN            NaN             NaN             NaN
14    Shin Da eun  Lee Jae hwang     Kim Hae in   Seo Do young           NaN            NaN             NaN             NaN
15    Yeo Jin goo   Lee Yeon hee   Ahn Jae hyun            NaN           NaN            NaN             NaN             NaN
16   Go Hyun jung   Park Jin hee   Lee Jin wook  Shin Sung rok  Bong Tae gyu  Park Ki woong   Jung Eun chae  Yoon Jong hoon
  • Related