I have a column in python Name
that reads in football team names and a number such as New Orleans Saints 2
I parse the data into 2 separate & additional columns as Name2
: New Orleans Saints
& Name3
: 2
I do this by using this line of code: NameSeparate=dt['Name'].str.split(r'\s(?:([ -]?\d{,100}\.?\d{1,100}?))', expand=True)
Then dt[Name2]
= NameSeparate[0]
& dt[Name3]
= NameSeparate[1]
separates and creates the 2 new columns
I have an issue when the code pulls in San Francisco 49ers 5
as the code parses the 49
instead of the 5
I've tried writing a check loop in but can't get the script to pass through the 49
and just pull the 5
The name will always come in as San Francisco 49ers
however the 2nd piece can have either a
or -
sign in addition to any number following, for example it may come in as: San Francisco 49ers 5
, San Francisco 49ers -5
, San Francisco 49ers 8.5
, San Francisco 49ers -8.5
, etc.
Can someone please help me write a check for this issue? Thank you!
CodePudding user response:
In case there is always a
or a -
in front of the number which should be in the second column, you should remove the ?
after [ -]
in your regex for splitting the original column:
NameSeparate=dt['Name'].str.split(r'\s(?:([ -]\d{,100}\.?\d{1,100}?))', expand=True)
Otherwise, you also select numbers where no such sign is placed before, as you already noticed.
If there are also entries where the number has no
or -
before it, you can add an $
to the very end of your regex string, which denotes the end of the string. Since the number seems to always come last, this also should work:
NameSeparate=dt['Name'].str.split(r'\s(?:([ -]?\d{,100}\.?\d{1,100}?))$', expand=True)