I am working on a baseball analysis project where I web-scrape the real-time lineups for a given team, on a given date.
I am currently facing an issue with the names that I receive in the scraped dataframe -- in random cases, the player names will come in a different format and are unusable (I take the player name and pass it into a statistics function which will only work if I have the players name formatted correctly.)
Example:
Freddie Freeman
Ozzie Albies
Ronald Acuna
Austin RileyA. A.Riley
Dansby Swanson
Adam Duvall
Joc PedersonJ. J.Pederson
As you can see, most of the names are formatted normally, however, In a few cases, the player name is displayed, along with the first letter of their first name added onto their last name, followed by a period, and then their First initial and last name. If I could turn: Austin RileyA. A.Riley, into Austin Riley, then everything would work.
This is a consistent theme throughout all teams and data that I pull -- sometimes there a few players whos names are formatted in this exact way -- FirstName LastName First letter of First Name. First initial. Last Name
I am trying to figure out a way to re-format the names so that they are usable and doing so in a way that is generalized/applicable to any possible names.
CodePudding user response:
If the theme is really consistent you could do something like this:
name_list = ['Freddie Freeman',
'Ozzie Albies',
'Ronald Acuna',
'Austin RileyA. A.Riley ',
'Dansby Swanson',
'Adam Duvall',
'Joc PedersonJ. J.Pederson']
new_list = []
for n in name_list:
new_list.append(n[:n.find('.')-1])
new_list
There are several methods to achieve this (also using regex which I would not reccomend). The example I have posted is the best in my opinion ( find() documentation
)