Home > Enterprise >  Extract the second instance of the website pattern in a string using pandas str.contains
Extract the second instance of the website pattern in a string using pandas str.contains

Time:12-26

I am trying to extract 2nd instance of www website from the below string. This is in a pandas dataframe.

https://google.com/url?q=https://www.accenture.com/in-en/insights/software-platforms/core- 
banking-on- 
cloud&sa=U&ved=2ahUKEwiQ75fwvYD1AhXOYMAKHXofCeoQFnoECAgQAg&usg=AOvVaw02sP402HcesId4vbgOaspD

So I want to extract the following string and store it in a separate column.

https://www.accenture.com/in-en/insights/software- 
                              platforms/core- banking-on- 
   cloud&sa=U&ved=2ahUKEwiQ75fwvYD1AhXOYMAKHXofCeoQFnoECAgQAg&usg=AOvVaw02sP402HcesId4vbgOaspD

Final Dataframe:

sr.no    link_orig              link_extracted
  1       <the above string>    <the extracted string that starts from 
                                 https://www.accenture.com>

Below is the code snippet:

df['link_extracted'] = `df['link_orig'].str.contains('www.accenture.com',regex=False,na-np.NaN)

I am getting the following error:

ValueError: Cannot mask with non-boolean array containing NA / NaN values

What I am missing here? If I have to use regex then what should be the approach?

CodePudding user response:

The error message means you probably have NaNs in the link_orig column. That can be fixed by adding a fillna('') to your code.

Something like

df['link_extracted'] = df['link_orig'].fillna('').str.contains ...

That said, I'm not sure the rest of your code will do what you want. That will just return True is www.accenture.com is anywhere in the link_orig string.

If the link you are trying to extract always contains www.accenture.com then you can do this

df['link_extracted'] = df['link_orig'].fillna('').str.extract('(www\.accenture\.com.*)')

CodePudding user response:

Personally, I'd use Series.str.extract() for this. E.g:

df['link_extracted'] = df['link_orig'].str.extract('http.*(http.*)')

This matches http, followed by anything, then captures http followed by anything.

An alternate approach would be to use urlparse.

CodePudding user response:

You can use urllib.parse module:

import pandas as pd
from urllib.parse import urlparse, parse_qs

url = 'https://google.com/url?q=https://www.accenture.com/in-en/insights/software-platforms/core-banking-on-cloud&sa=U&ved=2ahUKEwiQ75fwvYD1AhXOYMAKHXofCeoQFnoECAgQAg&usg=AOvVaw02sP402HcesId4vbgOaspD'
df = pd.DataFrame({'sr.no':[1], 'link_orig':[url]})

extract_q = lambda url: parse_qs(urlparse(url).query)['q'][0]
df['link_extracted'] = df['link_orig'].apply(extract_q)

Output:

>>> df
   sr.no                                          link_orig                                     link_extracted
0      1  https://google.com/url?q=https://www.accenture...  https://www.accenture.com/in-en/insights/softw...
  • Related