I'm trying to extract the user id from this link https://www.instagram.com/design.kaf/ using bs4 and Regex
Found a JSON key inside script tag called "profile_id" but I can't even search that script tag
You can find my try in regex here
Also I can't find something I can pull this certain <script>
tag
my code :
url= "https://www.instagram.com/design.kaf/"
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36'
}
response = requests.request("GET", url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
a=str(soup.findall("script"))
x = re.findall('profile_id":"-?\d "', a)
id = int(x[0])
print(id)
CodePudding user response:
you can try this code, it is an approach with loop and string search
import requests
from bs4 import BeautifulSoup
url = 'https://www.instagram.com/design.kaf/'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36'
}
r = requests.request("GET", url)
soup = BeautifulSoup(r.text, 'html.parser')
s = soup.findAll('script')
s = str(s)
id_str, counter = '', 0
counter = 0
while True:
# our required string format "profile_id":"0123456789....",
str_to_find = '"profile_id":"'
index_p = s.find(str_to_find) # returns the index of first character i.e. double quote
# first number of id will start from index_p length of the searched string
if s[index_p len(str_to_find) counter] == '"':
break # iteration will stop when we again find double quote
else:
id_str = s[index_p len(str_to_find) counter]
counter = 1
print(id_str) # print 5172989370 in this case
CodePudding user response:
Here is another answer using re
approach
import requests
from bs4 import BeautifulSoup
import re, ast
url = 'https://www.instagram.com/design.kaf/'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36'
}
r = requests.request("GET", url)
soup = BeautifulSoup(r.text, 'html.parser')
s = soup.findAll('script')
s = str(s)
# this will print "profile_id":"5172989370"
to_be_find_string = re.findall('"profile_id":"-?\d "', s)[0] # changed you regex by adding a double quote at the beginning
string_formatted_as_dict = '{' to_be_find_string '}'
# it will convert a type <str> formatted as dict to type <dict>
profile_dict = ast.literal_eval(string_formatted_as_dict)
print(profile_dict['profile_id']) # print your user id i.e. 5172989370
both of my answers are shared with explanation written as code comment, please upvote if you find these two answers useful