I'm just digging into Scrapy, so forgive the basic question, but why when I use view(response) when in the Scrapy shell, does it not show all of the HTML from the file it scraped?
I set the spider to crawl a page (Barry Bonds' page on Baseball Reference), using the same code as in the tutorial, only changing the name of the spider and the file name it saved as.
Once I scraped the page, I opened the HTML up in Safari (on a Mac) and the whole page shows up.
Then, back in terminal, I use the commands:
scrapy shell fileLocationOnComputer
view(response)
It opens Safari to show me that a large majority of the page is missing.
Here are two screenshots to depict my issue
Thanks for any help y'all can provide!
CodePudding user response:
Scrapy shell view(response) not showing all HTML
All we know that scrapy can't render JavaSvript
As scrapy can't render JavaSvript that's why scrapy
shell
view(response) can't see the HTML portion which is dynamically loaded by JavaScript and only for this reason Scrapy shell, does it not show all of the HTMLSafari , chrome or anyother browser will show complete HTML DOM no matter whether it's dynamic or static but only then you will see the difference between the dynamic and the static html in any browser when you will
make disable
JavaScript from the browser and refresh the oppend url then you will never can see the Dynamic HTML.That's why Scrapy view(response) not showing all HTML.
To pull static table with pandas
import pandas as pd
df =pd.read_html('https://www.baseball-reference.com/players/b/bondsba01.shtml')[0]
print(df)
Output:
Year Age Tm ... IBB Pos Awards
0 1985 20 PIT-min ... 0 NaN PRW · CARL
1 1986 21 PIT-min ... 0 NaN HAW · PCL
2 1986 21 PIT ... 2 *8/H RoY-6
3 1987 22 PIT ... 3 *78H/9 NaN
4 1988 23 PIT ... 14 *7H/8 NaN
5 1989 24 PIT ... 22 *7/H NaN
6 1990 25 PIT ... 15 *7/H8 AS,MVP-1,GG,SS
7 1991 26 PIT ... 25 *7/H8 MVP-2,GG,SS
8 1992 27 PIT ... 32 *7/H AS,MVP-1,GG,SS
9 1993 28 SFG ... 43 *7/H AS,MVP-1,GG,SS
10 1994 29 SFG ... 18 *7/H AS,MVP-4,GG,SS
11 1995 30 SFG ... 22 *7/H AS,MVP-12
12 1996 31 SFG ... 30 *7/H8 AS,MVP-5,GG,SS
13 1997 32 SFG ... 34 *7 AS,MVP-5,GG,SS
14 1998 33 SFG ... 29 *7/H AS,MVP-8,GG
15 1999 34 SFG ... 9 7/DH MVP-24
16 2000 35 SFG ... 22 *7/H AS,MVP-2,SS
17 2001 36 SFG ... 35 *7/DH AS,MVP-1,SS
18 2002 37 SFG ... 68 *7/DH AS,MVP-1,SS
19 2003 38 SFG ... 61 *7/DH AS,MVP-1,SS
20 2004 39 SFG ... 120 *7/HD AS,MVP-1,SS
21 2005 40 SFG ... 3 7/H NaN
22 2006 41 SFG ... 38 *7H/D NaN
23 2007 42 SFG ... 43 *7H/D AS
24 22 Yrs 22 Yrs 22 Yrs ... 688 NaN NaN
25 162 Game Avg. 162 Game Avg. 162 Game Avg. ... 37 NaN NaN
26 NaN NaN NaN ... IBB Pos Awards
27 SFG (15 yrs) SFG (15 yrs) SFG (15 yrs) ... 575 NaN NaN
28 PIT (7 yrs) PIT (7 yrs) PIT (7 yrs) ... 113 NaN NaN
[29 rows x 30 columns]
CodePudding user response:
The tables aren't really dynamic. They are actually just within the html comments.
Two ways you could pull those:
- Use BeautifulSoup to pull out the
Comments
then parse it - Simply remove the comment tags
This will get you all the tables. Just a matter of now pulling out the one you want by either is specific attribute, or index position in the df_list
.
import pandas as pd
import requests
response = requests.get('https://www.baseball-reference.com/players/b/bondsba01.shtml')
html = response.text.replace("<!--","").replace("-->","")
df_list = pd.read_html(html)
To specify a table:
import pandas as pd
import requests
response = requests.get('https://www.baseball-reference.com/players/b/bondsba01.shtml')
html = response.text.replace("<!--","").replace("-->","")
df = pd.read_html(html, attrs={'id':'batting_postseason'})[0]
Output:
print(df)
Year Age Tm ... IBB WPA cWPA
0 1990 25 PIT ... 0.0 -0.13 -0.2%
1 1991 26 PIT ... 0.0 -0.68 -14.7%
2 1992 27 PIT ... 1.0 0.08 1.0%
3 NaN NaN NaN ... NaN NaN NaN
4 1997 32 SFG ... 0.0 0.31 3.3%
5 NaN NaN NaN ... NaN NaN NaN
6 2000 35 SFG ... 1.0 -0.10 -1.6%
7 NaN NaN NaN ... NaN NaN NaN
8 2002 37 SFG ... 3.0 0.05 2.6%
9 2002 37 SFG ... 3.0 0.59 9.0%
10 2002 37 SFG ... 7.0 0.56 22.9%
11 2003 38 SFG ... 6.0 0.50 5.7%
12 7 Yrs (9 Series) 7 Yrs (9 Series) 7 Yrs (9 Series) ... 21.0 1.18 27.9%
13 4 NLDS 4 NLDS 4 NLDS ... 10.0 0.76 9.9%
14 4 NLCS 4 NLCS 4 NLCS ... 4.0 -0.14 -5.0%
15 1 WS 1 WS 1 WS ... 7.0 0.56 22.9%
[16 rows x 32 columns]