Home > Software design >  Ruby Rails Screen Scrape different results in Rails Console
Ruby Rails Screen Scrape different results in Rails Console

Time:02-18

I'm confused about a difference I'm seeing in Nokogiri commands run from Rails Console and what I get from the same commands run in a Rails Helper.

In Rails Console, I am able to capture the data I want with these commands:

endpoint = "https://basketball-reference.com/leagues/BAA_1947_totals.html"
browser = Watir::Browser.new(:chrome)
browser.goto(endpoint) 
@doc_season = Nokogiri::HTML.parse(URI.open("https://basketball-reference.com/leagues/BAA_1947_totals.html")) 
player_season_table = @doc_season.css("tbody")
rows = player_season_table.css("tr")
rows.search('.thead').each(&:remove) #THIS WORKED
rows[0].at_css("td").try(:text) # Gets single player name
rows[0].at_css("a").attributes["href"].try(:value) # Gets that player page URL

However, my rails helper that is meant to take those commands and fold them into methods:

module ScraperHelper
  def target_scrape(url)
    browser = Watir::Browser.new(:chrome)
    browser.goto(url)
    doc = Nokogiri::HTML.parse(browser.html)
  end
  def league_year_prefix(year, league = 'NBA')
    # aba_seasons = 1968..1976
    baa_seasons = 1947..1949
    baa_seasons.include?(year) ? league_year = "BAA_#{year}" : league_year = "#{league}_#{year}"
  end
  def players_total_of_season(year, league = 'NBA')
    # always the latter year of the season, first year is 1947 no quotes
    # ABA is 1968 to 1976
    league_year = league_year_prefix(year, league)
    @doc_season = target_scrape("http://basketball-reference.com/leagues/#{league_year}_totals.html")
  end
  def gather_players_from_season
    player_season_table = @doc_season.css("tbody")
    rows = player_season_table.css("tr")
    rows.search('.thead').each(&:remove)
    puts rows[0].at_css("td").try(:text)
    puts rows[0].at_css("a").attributes["href"].try(:value)
  end
end

On that module, I try to emulate the rails console commands and break them into modules. And to test it out (since I don't have any other functionality or views built yet), I run Rails console, include this helper and run the methods.

But I get wildly different results. in the gather_players_from_season method, I can see that

player_season_table = @doc_season.css("tbody")

Is no longer grabbing the same data it grabbed when run as a command line by line. It also doesn't like the attributes method here:

puts rows[0].at_css("a").attributes["href"].try(:value)

So my first thought is a difference in gems maybe? Watir is launching the headless browser. Nokogiri isn't causing errors as near as I can tell.

CodePudding user response:

Your first thought of comparing the Gem versions is a great idea, but I am noticing a difference between the two code solutions:

In the Rails Console

the code parses the HTML with URI.open: Nokogiri::HTML.parse(URI.open("some html"))

In the ScraperHelper code

the code does not call URI.open, Nokogiri::HTML.parse("some html")

Perhaps that difference will return different values and make the rest of the ScraperHelper return unexpected results.

  • Related