Home > other >  The crawler xpath is correct but why get get an empty list
The crawler xpath is correct but why get get an empty list

Time:10-13

The following picture is the test I wrote: I don't know why a xapth take less than content below returns an empty list
But the xpath is right ah,


This is all code, hope bosses correct
 import requests 
The import CSV
The from bs4 import BeautifulSoup
The from LXML import etree


Url="http://www.cqjlpggzyzhjy.gov.cn/cqjl/jyxx/003001/003001002/MoreInfo.aspx? CategoryNum=003001002 '
The header={
'the user-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36 '}

Def the main () :

For I in range (1 states) :
# post request to modify the form data only need to write to modify part, don't need it all the content in the form
Playload={I} '__EVENTARGUMENT:
R=requests. Post (url, headers=headers, data=https://bbs.csdn.net/topics/playload, timeout=30)
R.e ncoding="utf-8" # or r.c ontent, decode (' utf-8) HTML=etree. HTML (r)
HTML=etree. HTML (r.t ext)
# xpath to extract all within the tbody tr labels, the tail not text ()
TRS=HTML. Xpath ('//* [@ id="MoreInfoList1_DataGrid1"]//tr ')

Titles=[]
Messages=[]
Headers=[' name ', 'time']
Keylists=[]
Valuelists=[]
# for announcement of the name and date, there is the text () to extract the tags inside the text, and xpath returns is list
For the tr in TRS:
# all the title of the name of the first
Name=tr. Xpath (. '[2]//a///td/text ()') [0]
# get all the date of the first
Time=((tr) xpath (. '[3]//td/text ()') [0]). The replace (' \ n ', ')). The replace (' \ r ', ')
# splicing url, and take the first of all urls
Url="http://www.cqjlpggzyzhjy.gov.cn" + tr. Xpath (".//td [2]//a/@ href/text () ") [0]
R2=requests. Get (urls, headers=headers)
R2. Encoding="utf-8"
Html2=etree. HTML (r2. Text)

data=https://bbs.csdn.net/topics/{' the bid-winning notice: the name,
'time: the time
}
With the open (' the people announcement title ', 'w', encoding="utf-8", newline=' ') as fp:
Writer.=the CSV DictWriter (fp, headers)
Writer. Writeheader ()
Writer. Writerows (titles)

# in the table for each element

Keylist=html2. Xpath ("//body/table/@ class/tbody/tr/td [1]//p//text () ")
For I in range (6) :
Keylists. Append (keylist [I])

# each element in the table corresponding to the value of the
Valuelist=html2. Xpath ("//body/table/@ class/tbody/tr/td [2]//p//text () ")
For I in range (6) :
Valuelists. Append (valuelist [I])

For I in range (6) :

Key=keylists [I]
value=https://bbs.csdn.net/topics/valuelists [I]
Messages [0] [key]=value [0]

With the open (' slope announcement content of task 2 ', 'w', encoding="utf-8", newline=' ') as f:
Writer.=the CSV DictWriter (f, headers)
Writer. Writeheader ()
Writer. Writerows (messages)

The main ()



CodePudding user response:

O sorrow for a long time no one is pure self-study guidance

CodePudding user response:

After treated etree structure tag will change? Xpath is not the same as original browser match xpath

CodePudding user response:

refer to the original poster chuan er response:
the following picture is the test I wrote: I don't know why a xapth take less than content below returns an empty list
But the xpath is right ah,


This is all code, hope bosses correct
 import requests 
The import CSV
The from bs4 import BeautifulSoup
The from LXML import etree


Url="http://www.cqjlpggzyzhjy.gov.cn/cqjl/jyxx/003001/003001002/MoreInfo.aspx? CategoryNum=003001002 '
The header={
'the user-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36 '}

Def the main () :

For I in range (1 states) :
# post request to modify the form data only need to write to modify part, don't need it all the content in the form
Playload={I} '__EVENTARGUMENT:
R=requests. Post (url, headers=headers, data=https://bbs.csdn.net/topics/playload, timeout=30)
R.e ncoding="utf-8" # or r.c ontent, decode (' utf-8) HTML=etree. HTML (r)
HTML=etree. HTML (r.t ext)
# xpath to extract all within the tbody tr labels, the tail not text ()
TRS=HTML. Xpath ('//* [@ id="MoreInfoList1_DataGrid1"]//tr ')

Titles=[]
Messages=[]
Headers=[' name ', 'time']
Keylists=[]
Valuelists=[]
# for announcement of the name and date, there is the text () to extract the tags inside the text, and xpath returns is list
For the tr in TRS:
# all the title of the name of the first
Name=tr. Xpath (. '[2]//a///td/text ()') [0]
# get all the date of the first
Time=((tr) xpath (. '[3]//td/text ()') [0]). The replace (' \ n ', ')). The replace (' \ r ', ')
# splicing url, and take the first of all urls
Url="http://www.cqjlpggzyzhjy.gov.cn" + tr. Xpath (".//td [2]//a/@ href/text () ") [0]
R2=requests. Get (urls, headers=headers)
R2. Encoding="utf-8"
Html2=etree. HTML (r2. Text)

data=https://bbs.csdn.net/topics/{' the bid-winning notice: the name,
'time: the time
}
With the open (' the people announcement title ', 'w', encoding="utf-8", newline=' ') as fp:
Writer.=the CSV DictWriter (fp, headers)
Writer. Writeheader ()
Writer. Writerows (titles)

# in the table for each element

Keylist=html2. Xpath ("//body/table/@ class/tbody/tr/td [1]//p//text () ")
For I in range (6) :
Keylists. Append (keylist [I])

# each element in the table corresponding to the value of the
Valuelist=html2. Xpath ("//body/table/@ class/tbody/tr/td [2]//p//text () ")
For I in range (6) :
Valuelists. Append (valuelist [I])

For I in range (6) :

Key=keylists [I]
nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
  • Related