I have a csv file that looks like the following:
Halley Bailey - 1998
Hayley Orrantia (1994-) American actress, singer, and songwriter
Ken Watanabe (actor)
etc...
I’d like to remove the items in the parentheses, as well as the commas in some of the names that have commas, so that the dataframe looks like this:
Halley Bailey
Hayley Orrantia
Ken Watanabe
I attempted using the following code, which succeeds in removing the dates after the name, but not the parentheses or things after commmas, how could I expand it so it can replace all these items?
regex = '|'.join(map(re.escape, df['actors']))
CodePudding user response:
Try with the following '(^[^\(|^\-] )'
returning all matches before a -
or (
:
df['Full Name'] = df['Description'].str.extract('(^[^\(|^\-] )')
Returning:
Description Full Name
0 Halley Bailey - 1998 Halley Bailey
1 Hayley Orrantia (1994-) American actress, sing... Hayley Orrantia
2 Ken Watanabe (actor) Ken Watanabe
CodePudding user response:
Assuming that the csv content is in stored in the column csv
of the dataframe df
, and that df
looks like the following (if one doesn't know how to read a CSV into a Pandas Dataframe, see first Notes below)
csv
0 Halley Bailey - 1998
1 Hayley Orrantia (1994-) American actress, sing...
2 Ken Watanabe (actor)
If one wants to create a new column named actors
, considering that an actor full name is only composed of 2
words, the following will do the work
df['actors'] = df['csv'].str.split(' ').str[:2].str.join(' ')
[Out]:
csv actors
0 Halley Bailey - 1998 Halley Bailey
1 Hayley Orrantia (1994-) American actress, sing... Hayley Orrantia
2 Ken Watanabe (actor) Ken Watanabe
If, on another hand, one doesn't want to create a new column, one can do the following
df['csv'] = df['csv'].str.split(' ').str[:2].str.join(' ')
[Out]:
csv
0 Halley Bailey
1 Hayley Orrantia
2 Ken Watanabe
Notes:
If one doesn't know how to read a
.CSV
file as aPandas
DataFrame
, this should be relevant - Import CSV file as a Pandas DataFrame (particularly this answer)Depending on the dataframe, one might have to adjust the column names.