The import requests
R=requests. Get (' url ')
Data=https://bbs.csdn.net/topics/r.text
# using regular find all connection
Link_list=re. The.findall (r "(? <=href=https://bbs.csdn.net/"). +? (?=\ ") | (? <=href=https://bbs.csdn.net/'). +? (?=\ ') ", data)
For the url in link_list:
Print (url)
I use regular expressions to find a page with all the links and save for the lists. TXT or CSV file, could you tell me how to find these links to specific content? The following procedures, as if can't run, please bosses look,
The import requests
The import re
F=open (' lists. TXT ', 'r')
UrlList=f.r eadlines ()
For the url in urlList:
R=requests. Get (' url ')
Data=https://bbs.csdn.net/topics/r.text
[email=re. The.findall (r '0-9 a zA - Z.] + @ [0-9 a zA - Z.] +? Com ", data)
Print (email)
Above procedure upper part can be run separately, can be read printed url list, in the second part if given a url links, also can find the link in the email address, but the two parts together not, don't know is why?
CodePudding user response:
Show piece of dataCodePudding user response:
Suggest you don't directly with regular to do, give you an example below, in addition the library using regular, you can see his source in making,
The from simplified_scrapy import spiders, SimplifiedDoc SimplifiedMain, utils
The class MySpider (spiders) :
Name='test_spider'
Start_urls=[' your entry link address]
Refresh_urls=True
Def extract (self, url, HTML, models, modelNames) :
Doc=SimplifiedDoc (HTML)
LstA=None
If the url. The url in the self. Start_urls:
# here from the entry link to the corresponding page link
LstA=doc. Selects (' a ')
The else:
# extract data you want here
Email=doc. GetElement (' a ', attr='class', value='https://bbs.csdn.net/topics/email')
Print (email)
Return {" Urls ": lstA," Data ": None}
SimplifiedMain. StartThread (MySpider ()) # Start download