Home > Blockchain >  Extracting chosen information from URL results into a dataframe
Extracting chosen information from URL results into a dataframe

Time:03-06

I would like to create a dataframe by pulling only certain information from this website.

https://www.stockrover.com/build/production/Research/tail.js?1644930560

I would like to pull all the entries like this one. ["0005.HK","HSBC HOLDINGS","",""]

Another problem is, suppose I only want only the first 20,000 lines which is the stock information and there is other information after line 20,000 that I don't want included in the dataframe.

To summarize, could someone show me how to pull out just the information I'm trying to extract and create a dataframe with those results if this is possible.

A sample of the website results

function getStocksLibraryArray(){return[["0005.HK","HSBC HOLDINGS","",""],["0006.HK","Power Assets Holdings Ltd","",""],["000660.KS","SK hynix","",""],["004370.KS","Nongshim","",""],["005930.KS","Samsung Electroni","",""],["0123.HK","YUEXIU PROPERTY","",""],["0336.HK","HUABAO INTL","",""],["0408.HK","YIP'S CHEMICAL","",""],["0522.HK","ASM PACIFIC","",""],["0688.HK","CHINA OVERSEAS","",""],["0700.HK","TENCENT","",""],["0762.HK","CHINA UNICOM","",""],["0808.HK","PROSPERITY REIT","",""],["0813.HK","SHIMAO PROPERTY",

Code to pull all lines including ones not wanted

import requests
import pandas as pd
import requests

url = "https://www.stockrover.com/build/production/Research/tail.js?1644930560"

payload={}
headers = {}

response = requests.request("GET", url, headers=headers, data=payload)

print(response.text)

CodePudding user response:

Use regex to extract the details followed by literal_eval to convert string to python object

import re
from ast import literal_eval

import pandas as pd
import requests

url = "https://www.stockrover.com/build/production/Research/tail.js?1644930560"

response = requests.request("GET", url, headers={}, data={})

regex_ = re.compile(r"getStocksLibraryArray\(\)\{return(. ?)}", re.DOTALL)

print(pd.DataFrame(literal_eval(regex_.search(response.text).group(1))))

               0                          1       2 3 
0        0005.HK              HSBC HOLDINGS           
1        0006.HK  Power Assets Holdings Ltd           
2      000660.KS                   SK hynix           
3      004370.KS                   Nongshim           
4      005930.KS          Samsung Electroni           
...          ...                        ...     ... ..
21426      ZZHGF         ZhongAn Online P&C  _INSUP   
21427      ZZHGY         ZhongAn Online P&C  _INSUP   
21428       ZZLL      ZZLL Information Tech  _INTEC   
21429     ZZZ.TO       Sleep Country Canada  _SPECR   
21430      ZZZOF         Zinc One Resources  _OTHEI   
  • Related