How to efficiently subset from a cell pandas-CodePudding

I have a following problem: In a column agent_string is a long string like this:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36

I need to create a new column typ_browser where will be the last word of agent_string or "Chrome" if the word "Chrome" is in agent_string. I came up with this solution, which is unfortunately very slow, because I have 500.000 rows:

data["typ_browser"] = ""

def zjisti_browser(agent):
    rozdeleno = agent.split(" ")
    if "Chrome" in agent:
        return "Chrome"
    return rozdeleno[-1]

for i in range(len(data["agent_string"])):
    data["typ_browser"][i] = zjisti_browser(data["agent_string"][i])

Is there something faster?

CodePudding user response：

Use numpy.where with test values by Series.str.contains, if no match get last value after split:

data["typ_browser"] = np.where(data["agent_string"].str.contains('Chrome'), 
                               'Chrome',
                               data["agent_string"].str.split().str[-1])

Or:

data["typ_browser"] = np.where(data["agent_string"].str.contains('Chrome'), 
                               'Chrome',
                               data["agent_string"].str.extract(r'(\S*$)', expand=False))

CodePudding user response：

Since you're using the values from one column (Series) to create another, you can use map with your function to create the new column.

data["typ_browser"] = data["agent_string"].map(zjisti_browser)

This will apply the function zjisti_browser to each element in your agent_string and return a new column.

Also, note how jezrael's solution avoids splitting the string if it doesn't need to. You can do that in your function too.

def zjisti_browser(agent):
    if "Chrome" in agent:
        return "Chrome"
    return agent.split()[-1]

This will be a small performance improvement no matter which solution you choose.