Home > Net >  How to extract key info from <script> tag
How to extract key info from <script> tag

Time:08-01

I'm trying to extract the user id from this link https://www.instagram.com/design.kaf/ using bs4 and Regex

Found a JSON key inside script tag called "profile_id" but I can't even search that script tag

You can find my try in regex here

https://regex101.com/r/WmlAEc/1

Also I can't find something I can pull this certain <script> tag

my code :

    url= "https://www.instagram.com/design.kaf/"
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36'
                    }
    
    response = requests.request("GET", url, headers=headers)
    soup = BeautifulSoup(response.text, 'lxml') 
    a=str(soup.findall("script"))
    x = re.findall('profile_id":"-?\d "', a)
    id = int(x[0])
    print(id)

CodePudding user response:

you can try this code, it is an approach with loop and string search

import requests
from bs4 import BeautifulSoup

url = 'https://www.instagram.com/design.kaf/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36'
}

r = requests.request("GET", url)
soup = BeautifulSoup(r.text, 'html.parser')
s = soup.findAll('script')
s = str(s)

id_str, counter = '', 0
counter = 0
while True:
    # our required string format "profile_id":"0123456789....",
    str_to_find = '"profile_id":"'
    index_p = s.find(str_to_find) # returns the index of first character i.e. double quote

    # first number of id will start from index_p   length of the searched string
    if s[index_p len(str_to_find) counter] == '"':
        break # iteration will stop when we again find double quote
    else:
        id_str  = s[index_p len(str_to_find) counter]
        counter  = 1

print(id_str) # print 5172989370 in this case

CodePudding user response:

Here is another answer using re approach

import requests
from bs4 import BeautifulSoup
import re, ast

url = 'https://www.instagram.com/design.kaf/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36'
}

r = requests.request("GET", url)
soup = BeautifulSoup(r.text, 'html.parser')
s = soup.findAll('script')
s = str(s)

# this will print "profile_id":"5172989370"
to_be_find_string = re.findall('"profile_id":"-?\d "', s)[0] # changed you regex by adding a double quote at the beginning

string_formatted_as_dict = '{'  to_be_find_string   '}'

# it will convert a type <str> formatted as dict to type <dict>
profile_dict = ast.literal_eval(string_formatted_as_dict)

print(profile_dict['profile_id']) # print your user id i.e. 5172989370

both of my answers are shared with explanation written as code comment, please upvote if you find these two answers useful

  • Related