I have a following problem:
In a column agent_string
is a long string like this:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36
I need to create a new column typ_browser
where will be the last word of agent_string
or "Chrome" if the word "Chrome" is in agent_string
. I came up with this solution, which is unfortunately very slow, because I have 500.000 rows:
data["typ_browser"] = ""
def zjisti_browser(agent):
rozdeleno = agent.split(" ")
if "Chrome" in agent:
return "Chrome"
return rozdeleno[-1]
for i in range(len(data["agent_string"])):
data["typ_browser"][i] = zjisti_browser(data["agent_string"][i])
Is there something faster?
CodePudding user response:
Use numpy.where
with test values by Series.str.contains
, if no match get last value after split
:
data["typ_browser"] = np.where(data["agent_string"].str.contains('Chrome'),
'Chrome',
data["agent_string"].str.split().str[-1])
Or:
data["typ_browser"] = np.where(data["agent_string"].str.contains('Chrome'),
'Chrome',
data["agent_string"].str.extract(r'(\S*$)', expand=False))
CodePudding user response:
Since you're using the values from one column (Series) to create another, you can use map
with your function to create the new column.
data["typ_browser"] = data["agent_string"].map(zjisti_browser)
This will apply the function zjisti_browser
to each element in your agent_string
and return a new column.
Also, note how jezrael's solution avoids splitting the string if it doesn't need to. You can do that in your function too.
def zjisti_browser(agent):
if "Chrome" in agent:
return "Chrome"
return agent.split()[-1]
This will be a small performance improvement no matter which solution you choose.