I am trying to extract 2nd instance of www
website from the below string. This is in a pandas dataframe.
https://google.com/url?q=https://www.accenture.com/in-en/insights/software-platforms/core-
banking-on-
cloud&sa=U&ved=2ahUKEwiQ75fwvYD1AhXOYMAKHXofCeoQFnoECAgQAg&usg=AOvVaw02sP402HcesId4vbgOaspD
So I want to extract the following string and store it in a separate column.
https://www.accenture.com/in-en/insights/software-
platforms/core- banking-on-
cloud&sa=U&ved=2ahUKEwiQ75fwvYD1AhXOYMAKHXofCeoQFnoECAgQAg&usg=AOvVaw02sP402HcesId4vbgOaspD
Final Dataframe:
sr.no link_orig link_extracted
1 <the above string> <the extracted string that starts from
https://www.accenture.com>
Below is the code snippet:
df['link_extracted'] = `df['link_orig'].str.contains('www.accenture.com',regex=False,na-np.NaN)
I am getting the following error:
ValueError: Cannot mask with non-boolean array containing NA / NaN values
What I am missing here? If I have to use regex then what should be the approach?
CodePudding user response:
The error message means you probably have NaNs in the link_orig
column. That can be fixed by adding a fillna('')
to your code.
Something like
df['link_extracted'] = df['link_orig'].fillna('').str.contains ...
That said, I'm not sure the rest of your code will do what you want. That will just return True
is www.accenture.com
is anywhere in the link_orig
string.
If the link you are trying to extract always contains www.accenture.com
then you can do this
df['link_extracted'] = df['link_orig'].fillna('').str.extract('(www\.accenture\.com.*)')
CodePudding user response:
Personally, I'd use Series.str.extract()
for this. E.g:
df['link_extracted'] = df['link_orig'].str.extract('http.*(http.*)')
This matches http
, followed by anything, then captures http
followed by anything.
An alternate approach would be to use urlparse.
CodePudding user response:
You can use urllib.parse
module:
import pandas as pd
from urllib.parse import urlparse, parse_qs
url = 'https://google.com/url?q=https://www.accenture.com/in-en/insights/software-platforms/core-banking-on-cloud&sa=U&ved=2ahUKEwiQ75fwvYD1AhXOYMAKHXofCeoQFnoECAgQAg&usg=AOvVaw02sP402HcesId4vbgOaspD'
df = pd.DataFrame({'sr.no':[1], 'link_orig':[url]})
extract_q = lambda url: parse_qs(urlparse(url).query)['q'][0]
df['link_extracted'] = df['link_orig'].apply(extract_q)
Output:
>>> df
sr.no link_orig link_extracted
0 1 https://google.com/url?q=https://www.accenture... https://www.accenture.com/in-en/insights/softw...