I am trying to separate a list of actors in the Starring column. This is what it looks like now.
df.Starring.value_counts().head(50)
none
Choi Kang hee Kwon Sang woo
Kim Jun myeon Ha Yeon soo Oh Chang suk Kim Ye won
Yoon Shi yoon Cho Soo hyang
Kim Jaewon Kim Ha neul
No Min woo Hong Ah reum
Jang Hyuk Lee Da hae
So Ji sub Kim Ha neul Yoon Kye sang
Kim So hyun Na In woo Lee Ji hoon Choi Yu hwa
Ji Hyun woo Lee Si young Kim Jin yeop Yoon Joo hee
Uhm Tae woong Lee Si young Lee Soo hyuk
Doh Kyung soo Nam Ji hyun Jo Sung ha Jo Han chul Kim Seon ho Han So hee Kim Jae young
Sung Yu ri Jung Gyu woon Kim Min jun Min Hyo rin
Yoon Joo sang Hong Eun hee Jeon Hye bin Kim Kyung nam Go Won hee Lee Bo hee Lee Byung joon Choi Dae chul
Choi Si won Kang So ra Gong Myung
Shin Da eun Lee Jae hwang Kim Hae in Seo Do young
Yeo Jin goo Lee Yeon hee Ahn Jae hyun
Go Hyun jung Park Jin hee Lee Jin wook Shin Sung rok Bong Tae gyu Park Ki woong Jung Eun chae Yoon Jong hoon
I have tried to separate it first with a comma and then later on explode()
the Starring column but I am stuck at finding a way to separate the actors in each row perfectly.
I tried this and it did nothing. It did not even add the commas
df['Starring'] = df['Starring'].apply(lambda x: str(x) ',')
So did this one. It also did not even add the comma to separate it.
df.Starring.apply(lambda x: x ',' if len(x.split(' ')) == 2 else x)
I tried experimenting by using regex by using df.Starring.str.replace() and to no avail.
This is how I want it too look like. After I have added the comma and use it to split and explode the Starring column into Actro_1 , Actor2..etc and drop the Starring column.
Actor_1 Actor_2 Actor_3 Actor_4
Choi Kang hee Kwon Sang woo none none
Kim Jun myeon Ha Yeon soo Oh Chang suk Kim Ye won
Yoon Shi yoon Cho Soo hyang none none
Kim Jaewon Kim Ha neul none none
No Min woo Hong Ah reum none none
Jang Hyuk Lee Da hae none none
Thanks again for your help.
CodePudding user response:
You can use .extractall()
to extract the names (two title-cased words followed by an optional third lowercase word), and then use .pivot()
to list out the names on a single row. Finally, you can rename all of the columns using a list comprehension:
df = pd.DataFrame(df["none"].str.extractall(r"([A-Z][a-z] [A-Z][a-z] (?: [a-z] )?)")).reset_index()
df = df.pivot(index=["level_0"], columns=["match"]).fillna("none")
df.columns = [f"Actor_{i 1}" for i in df.columns.get_level_values(1)]
df
This outputs:
Actor_1 Actor_2 Actor_3 ... Actor_6 Actor_7 Actor_8
level_0 ...
0 Choi Kang hee Kwon Sang woo none ... none none none
1 Kim Jun myeon Ha Yeon soo Oh Chang suk ... none none none
2 Yoon Shi yoon Cho Soo hyang none ... none none none
3 Kim Jaewon Kim Ha neul none ... none none none
4 No Min woo Hong Ah reum none ... none none none
5 Jang Hyuk Lee Da hae none ... none none none
6 So Ji sub Kim Ha neul Yoon Kye sang ... none none none
7 Kim So hyun Na In woo Lee Ji hoon ... none none none
8 Ji Hyun woo Lee Si young Kim Jin yeop ... none none none
9 Uhm Tae woong Lee Si young Lee Soo hyuk ... none none none
10 Doh Kyung soo Nam Ji hyun Jo Sung ha ... Han So hee Kim Jae young none
11 Sung Yu ri Jung Gyu woon Kim Min jun ... none none none
12 Yoon Joo sang Hong Eun hee Jeon Hye bin ... Lee Bo hee Lee Byung joon Choi Dae chul
13 Choi Si won Kang So ra Gong Myung ... none none none
14 Shin Da eun Lee Jae hwang Kim Hae in ... none none none
15 Yeo Jin goo Lee Yeon hee Ahn Jae hyun ... none none none
16 Go Hyun jung Park Jin hee Lee Jin wook ... Park Ki woong Jung Eun chae Yoon Jong hoon
CodePudding user response:
You can use str.findall
to get all actors:
out = (df['Starring'].str.findall(r'(\w \w (?: [a-z] )?)').apply(pd.Series)
.rename(columns=lambda x: f"Actor_{x 1}"))
print(out)
# Output
Actor_1 Actor_2 Actor_3 Actor_4 Actor_5 Actor_6 Actor_7 Actor_8
0 Choi Kang hee Kwon Sang woo NaN NaN NaN NaN NaN NaN
1 Kim Jun myeon Ha Yeon soo Oh Chang suk Kim Ye won NaN NaN NaN NaN
2 Yoon Shi yoon Cho Soo hyang NaN NaN NaN NaN NaN NaN
3 Kim Jaewon Kim Ha neul NaN NaN NaN NaN NaN NaN
4 No Min woo Hong Ah reum NaN NaN NaN NaN NaN NaN
5 Jang Hyuk Lee Da hae NaN NaN NaN NaN NaN NaN
6 So Ji sub Kim Ha neul Yoon Kye sang NaN NaN NaN NaN NaN
7 Kim So hyun Na In woo Lee Ji hoon Choi Yu hwa NaN NaN NaN NaN
8 Ji Hyun woo Lee Si young Kim Jin yeop Yoon Joo hee NaN NaN NaN NaN
9 Uhm Tae woong Lee Si young Lee Soo hyuk NaN NaN NaN NaN NaN
10 Doh Kyung soo Nam Ji hyun Jo Sung ha Jo Han chul Kim Seon ho Han So hee Kim Jae young NaN
11 Sung Yu ri Jung Gyu woon Kim Min jun Min Hyo rin NaN NaN NaN NaN
12 Yoon Joo sang Hong Eun hee Jeon Hye bin Kim Kyung nam Go Won hee Lee Bo hee Lee Byung joon Choi Dae chul
13 Choi Si won Kang So ra Gong Myung NaN NaN NaN NaN NaN
14 Shin Da eun Lee Jae hwang Kim Hae in Seo Do young NaN NaN NaN NaN
15 Yeo Jin goo Lee Yeon hee Ahn Jae hyun NaN NaN NaN NaN NaN
16 Go Hyun jung Park Jin hee Lee Jin wook Shin Sung rok Bong Tae gyu Park Ki woong Jung Eun chae Yoon Jong hoon