Python fetching neutron link under the specific content of a page-CodePudding

The import re
The import requests

R=requests. Get (' url ')
Data=https://bbs.csdn.net/topics/r.text

# using regular find all connection
Link_list=re. The.findall (r "(? <=href=https://bbs.csdn.net/"). +? (?=\ ") | (? <=href=https://bbs.csdn.net/'). +? (?=\ ') ", data)
For the url in link_list:
Print (url)

I use regular expressions to find a page with all the links and save for the lists. TXT or CSV file, could you tell me how to find these links to specific content? The following procedures, as if can't run, please bosses look,

The import requests
The import re

F=open (' lists. TXT ', 'r')
UrlList=f.r eadlines ()
For the url in urlList:
R=requests. Get (' url ')
Data=https://bbs.csdn.net/topics/r.text
[email=re. The.findall (r '0-9 a zA - Z.] + @ [0-9 a zA - Z.] +? Com ", data)
Print (email)

Above procedure upper part can be run separately, can be read printed url list, in the second part if given a url links, also can find the link in the email address, but the two parts together not, don't know is why?

CodePudding user response:

Show piece of data

CodePudding user response:

Suggest you don't directly with regular to do, give you an example below, in addition the library using regular, you can see his source in making,

 
The from simplified_scrapy import spiders, SimplifiedDoc SimplifiedMain, utils 

The class MySpider (spiders) : 
Name='test_spider' 
Start_urls=[' your entry link address] 
Refresh_urls=True 

Def extract (self, url, HTML, models, modelNames) : 
Doc=SimplifiedDoc (HTML) 
LstA=None 
If the url. The url in the self. Start_urls: 
# here from the entry link to the corresponding page link 
LstA=doc. Selects (' a ') 
The else: 
# extract data you want here 
Email=doc. GetElement (' a ', attr='class', value='https://bbs.csdn.net/topics/email') 
Print (email) 
Return {" Urls ": lstA," Data ": None} 


SimplifiedMain. StartThread (MySpider ()) # Start download