I am trying to separate a list of actors in the Starring
column of a dataframe, and add them as separate columns.
This is what it looks like now:
df.Starring.value_counts().head(50)
none
Choi Kang hee Kwon Sang woo
Kim Jun myeon Ha Yeon soo Oh Chang suk Kim Ye won
Yoon Shi yoon Cho Soo hyang
Kim Jaewon Kim Ha neul
No Min woo Hong Ah reum
Jang Hyuk Lee Da hae
So Ji sub Kim Ha neul Yoon Kye sang
Kim So hyun Na In woo Lee Ji hoon Choi Yu hwa
Ji Hyun woo Lee Si young Kim Jin yeop Yoon Joo hee
Uhm Tae woong Lee Si young Lee Soo hyuk
Doh Kyung soo Nam Ji hyun Jo Sung ha Jo Han chul Kim Seon ho Han So hee Kim Jae young
Sung Yu ri Jung Gyu woon Kim Min jun Min Hyo rin
Yoon Joo sang Hong Eun hee Jeon Hye bin Kim Kyung nam Go Won hee Lee Bo hee Lee Byung joon Choi Dae chul
Choi Si won Kang So ra Gong Myung
Shin Da eun Lee Jae hwang Kim Hae in Seo Do young
Yeo Jin goo Lee Yeon hee Ahn Jae hyun
Go Hyun jung Park Jin hee Lee Jin wook Shin Sung rok Bong Tae gyu Park Ki woong Jung Eun chae Yoon Jong hoon
I have tried to separate it first with a comma and then later on explode()
the Starring
column but I am stuck at finding a way to separate the actors in each row perfectly.
I tried this and it did nothing. It did not even add the commas
df['Starring'] = df['Starring'].apply(lambda x: str(x) ',')
So I tried this. It also did not even add the comma to separate it.
df.Starring.apply(lambda x: x ',' if len(x.split(' ')) == 2 else x)
I tried experimenting by using regex by using df.Starring.str.replace()
and to no avail.
Desired output This is how I want it to look like. After I have added the comma and use it to split and explode the Starring
column into Actor_1
, Actor_2
..etc and drop the Starring
column.
Actor_1 Actor_2 Actor_3 Actor_4
Choi Kang hee Kwon Sang woo none none
Kim Jun myeon Ha Yeon soo Oh Chang suk Kim Ye won
Yoon Shi yoon Cho Soo hyang none none
Kim Jaewon Kim Ha neul none none
No Min woo Hong Ah reum none none
Jang Hyuk Lee Da hae none none
CodePudding user response:
You can use str.findall
to get all actors:
out = (df['Starring'].str.findall(r'(\w \w (?: [a-z] )?)').apply(pd.Series)
.rename(columns=lambda x: f"Actor_{x 1}"))
print(out)
# Output
Actor_1 Actor_2 Actor_3 Actor_4 Actor_5 Actor_6 Actor_7 Actor_8
0 Choi Kang hee Kwon Sang woo NaN NaN NaN NaN NaN NaN
1 Kim Jun myeon Ha Yeon soo Oh Chang suk Kim Ye won NaN NaN NaN NaN
2 Yoon Shi yoon Cho Soo hyang NaN NaN NaN NaN NaN NaN
3 Kim Jaewon Kim Ha neul NaN NaN NaN NaN NaN NaN
4 No Min woo Hong Ah reum NaN NaN NaN NaN NaN NaN
5 Jang Hyuk Lee Da hae NaN NaN NaN NaN NaN NaN
6 So Ji sub Kim Ha neul Yoon Kye sang NaN NaN NaN NaN NaN
7 Kim So hyun Na In woo Lee Ji hoon Choi Yu hwa NaN NaN NaN NaN
8 Ji Hyun woo Lee Si young Kim Jin yeop Yoon Joo hee NaN NaN NaN NaN
9 Uhm Tae woong Lee Si young Lee Soo hyuk NaN NaN NaN NaN NaN
10 Doh Kyung soo Nam Ji hyun Jo Sung ha Jo Han chul Kim Seon ho Han So hee Kim Jae young NaN
11 Sung Yu ri Jung Gyu woon Kim Min jun Min Hyo rin NaN NaN NaN NaN
12 Yoon Joo sang Hong Eun hee Jeon Hye bin Kim Kyung nam Go Won hee Lee Bo hee Lee Byung joon Choi Dae chul
13 Choi Si won Kang So ra Gong Myung NaN NaN NaN NaN NaN
14 Shin Da eun Lee Jae hwang Kim Hae in Seo Do young NaN NaN NaN NaN
15 Yeo Jin goo Lee Yeon hee Ahn Jae hyun NaN NaN NaN NaN NaN
16 Go Hyun jung Park Jin hee Lee Jin wook Shin Sung rok Bong Tae gyu Park Ki woong Jung Eun chae Yoon Jong hoon