Home > OS >  How to substring with specific start and end positions where a set of characters appear?
How to substring with specific start and end positions where a set of characters appear?

Time:03-11

I am trying to clean the data I scraped from their links. I have over 100 links in a CSV I'm trying to clean.

This is what a link looks like in the CSV:

"https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"

I've observed that scraping this for HTML data doesn't go well and I have to get the URL present inside this. I want to get the substring which starts with &url= and ends at &ct as that's where the real URL resides.

I've read posts like this but couldn't find one for ending str too. I've tried an approach from this using the substring package but it doesn't work for more than one character.

How do I do this? Preferably without using third party packages?

CodePudding user response:

I don't understand problem

If you have string then you can use string- functions like .find() and slice [start:end]

text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"

start = text.find('url=')   len('url=')
end   = text.find('&ct=')

text[start:end]

But it may have url= and ct= in different order so better search first & after url=

text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"

start = text.find('url=')   len('url=')
end   = text.find('&', start)

text[start:end]

EDIT:

There is also standard module urllib.parse to work with url - to split or join it.

text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"

import urllib.parse

url, query = urllib.parse.splitquery(text)
data       = urllib.parse.parse_qs(query)

data['url'][0]

In data you have dictionary

{'cd': ['SldisGkopisopiasenjA6Y28Ug'],
 'ct': ['ga'],
 'rct': ['j'],
 'sa': ['t'],
 'url': ['https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428'],
 'usg': ['AFQjaskdfYJkasKugowe896fsdgfsweF']}

EDIT:

Python shows warning that splitquery() is deprecated as of 3.8 and code should use urlparse()

text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"

import urllib.parse

parts = urllib.parse.urlparse(text)
data  = urllib.parse.parse_qs(parts.query)

data['url'][0]
  • Related