Home > Net >  Scrapy view(response) not showing all HTML that was scraped
Scrapy view(response) not showing all HTML that was scraped

Time:05-18

I'm just digging into Scrapy, so forgive the basic question, but why when I use view(response) when in the Scrapy shell, does it not show all of the HTML from the file it scraped?

I set the spider to crawl a page (Barry Bonds' page on Baseball Reference), using the same code as in the tutorial, only changing the name of the spider and the file name it saved as.

Once I scraped the page, I opened the HTML up in Safari (on a Mac) and the whole page shows up.

Then, back in terminal, I use the commands:

scrapy shell fileLocationOnComputer
view(response)

It opens Safari to show me that a large majority of the page is missing.

Here are two screenshots to depict my issue

Thanks for any help y'all can provide!

CodePudding user response:

            Scrapy shell view(response) not showing all HTML 
  • All we know that scrapy can't render JavaSvript

  • As scrapy can't render JavaSvript that's why scrapy shell view(response) can't see the HTML portion which is dynamically loaded by JavaScript and only for this reason Scrapy shell, does it not show all of the HTML

  • Safari , chrome or anyother browser will show complete HTML DOM no matter whether it's dynamic or static but only then you will see the difference between the dynamic and the static html in any browser when you will make disable JavaScript from the browser and refresh the oppend url then you will never can see the Dynamic HTML.That's why Scrapy view(response) not showing all HTML.

To pull static table with pandas

import pandas as pd
df =pd.read_html('https://www.baseball-reference.com/players/b/bondsba01.shtml')[0]
print(df)

Output:

            Year            Age             Tm  ...  IBB     Pos          Awards
0            1985             20        PIT-min  ...    0     NaN      PRW · CARL
1            1986             21        PIT-min  ...    0     NaN       HAW · PCL
2            1986             21            PIT  ...    2    *8/H           RoY-6
3            1987             22            PIT  ...    3  *78H/9             NaN
4            1988             23            PIT  ...   14   *7H/8             NaN
5            1989             24            PIT  ...   22    *7/H             NaN
6            1990             25            PIT  ...   15   *7/H8  AS,MVP-1,GG,SS
7            1991             26            PIT  ...   25   *7/H8     MVP-2,GG,SS
8            1992             27            PIT  ...   32    *7/H  AS,MVP-1,GG,SS
9            1993             28            SFG  ...   43    *7/H  AS,MVP-1,GG,SS
10           1994             29            SFG  ...   18    *7/H  AS,MVP-4,GG,SS
11           1995             30            SFG  ...   22    *7/H       AS,MVP-12
12           1996             31            SFG  ...   30   *7/H8  AS,MVP-5,GG,SS
13           1997             32            SFG  ...   34      *7  AS,MVP-5,GG,SS
14           1998             33            SFG  ...   29    *7/H     AS,MVP-8,GG
15           1999             34            SFG  ...    9    7/DH          MVP-24
16           2000             35            SFG  ...   22    *7/H     AS,MVP-2,SS
17           2001             36            SFG  ...   35   *7/DH     AS,MVP-1,SS
18           2002             37            SFG  ...   68   *7/DH     AS,MVP-1,SS
19           2003             38            SFG  ...   61   *7/DH     AS,MVP-1,SS
20           2004             39            SFG  ...  120   *7/HD     AS,MVP-1,SS
21           2005             40            SFG  ...    3     7/H             NaN
22           2006             41            SFG  ...   38   *7H/D             NaN
23           2007             42            SFG  ...   43   *7H/D              AS
24         22 Yrs         22 Yrs         22 Yrs  ...  688     NaN             NaN
25  162 Game Avg.  162 Game Avg.  162 Game Avg.  ...   37     NaN             NaN
26            NaN            NaN            NaN  ...  IBB     Pos          Awards
27   SFG (15 yrs)   SFG (15 yrs)   SFG (15 yrs)  ...  575     NaN             NaN
28    PIT (7 yrs)    PIT (7 yrs)    PIT (7 yrs)  ...  113     NaN             NaN

[29 rows x 30 columns]


 

CodePudding user response:

The tables aren't really dynamic. They are actually just within the html comments.

Two ways you could pull those:

  1. Use BeautifulSoup to pull out the Comments then parse it
  2. Simply remove the comment tags

This will get you all the tables. Just a matter of now pulling out the one you want by either is specific attribute, or index position in the df_list.

import pandas as pd
import requests

response = requests.get('https://www.baseball-reference.com/players/b/bondsba01.shtml')
html = response.text.replace("<!--","").replace("-->","")

df_list = pd.read_html(html)

To specify a table:

import pandas as pd
import requests

response = requests.get('https://www.baseball-reference.com/players/b/bondsba01.shtml')
html = response.text.replace("<!--","").replace("-->","")

df = pd.read_html(html, attrs={'id':'batting_postseason'})[0]

Output:

print(df)
                Year               Age                Tm  ...   IBB   WPA    cWPA
0               1990                25               PIT  ...   0.0 -0.13   -0.2%
1               1991                26               PIT  ...   0.0 -0.68  -14.7%
2               1992                27               PIT  ...   1.0  0.08    1.0%
3                NaN               NaN               NaN  ...   NaN   NaN     NaN
4               1997                32               SFG  ...   0.0  0.31    3.3%
5                NaN               NaN               NaN  ...   NaN   NaN     NaN
6               2000                35               SFG  ...   1.0 -0.10   -1.6%
7                NaN               NaN               NaN  ...   NaN   NaN     NaN
8               2002                37               SFG  ...   3.0  0.05    2.6%
9               2002                37               SFG  ...   3.0  0.59    9.0%
10              2002                37               SFG  ...   7.0  0.56   22.9%
11              2003                38               SFG  ...   6.0  0.50    5.7%
12  7 Yrs (9 Series)  7 Yrs (9 Series)  7 Yrs (9 Series)  ...  21.0  1.18   27.9%
13            4 NLDS            4 NLDS            4 NLDS  ...  10.0  0.76    9.9%
14            4 NLCS            4 NLCS            4 NLCS  ...   4.0 -0.14   -5.0%
15              1 WS              1 WS              1 WS  ...   7.0  0.56   22.9%

[16 rows x 32 columns]
  • Related