So my issue is that, I want to get user's id info from the chat.
The chat area what I'm looking for, looks like this...
<div id="chat_area" style="will-change: scroll-position;">
<dl user_id="asdf1234"><dt ><em ></em> :</dt><dd id="1">blah blah</dd></dl>
<a href="javascript:;" user_id="asdf1234" user_nick="asdf1234" userflag="65536" is_mobile="false" grade="user">asdf1234</a>
...
What I want do is to,
Get the part starting with <a href='javascript:'' user_id='asdf1234' ...
so that I can parse this and do some other stuffs.
But this webpage is the one I'm currently using, and it can not be proxy(webdriver by selenium).
How can I extract that data from the chat?
CodePudding user response:
It looks like you've got two separate problems here. I'd use both the requests and BeautifulSoup libraries to accomplish this.
Use your browser's developer tools, the network tab, to refresh the page and look for the request which responds with the HTML you want. Use the requests library to emulate this request exactly.
import requests
headers = {"name": "value"}
# Get case example.
response = requests.get("some_url", headers=headers)
# Post case example.
data = {"key": "value"}
response = requests.post("some_url", headers=headers, data=data)
Web-scraping is always finicky, if this doesn't work you're most likely going to need to use a requests session. Or a one-time hacky solution is just to set your cookies from the browser.
Once you have made the request you can use BeautifulSoup to scrape your user id very easily.
from bs4 import BeautifulSoup
# Create BS parser.
soup = BeautifulSoup(response.text, 'lxml')
# Find all elements with the attribute "user_id".
find_results = soup.findAll("a", {"user_id" : True})
# Iterate results. Could also just index if you want the single user_id.
for result in find_results:
user_id = result["user_id"]
CodePudding user response:
In order to extract data from the chat area you would need to use a web scraping tool or library. Since you mentioned that you cannot use a proxy such as Selenium, you may want to consider using a library in a programming language like Python or JavaScript to scrape the data from the chat area.
For example, in Python you could use BeautifulSoup to parse the HTML of the page and extract the desired information. You could then use the user_id value to do any further processing that you need to do.
Alternatively, if you have access to the server-side code for the page, you could modify it to include the user_id information in a more easily accessible way, such as in a data attribute on the chat area element itself. This would allow you to easily retrieve the user_id value using JavaScript without having to scrape the page.