I'm confused about a difference I'm seeing in Nokogiri commands run from Rails Console and what I get from the same commands run in a Rails Helper.
In Rails Console, I am able to capture the data I want with these commands:
endpoint = "https://basketball-reference.com/leagues/BAA_1947_totals.html"
browser = Watir::Browser.new(:chrome)
browser.goto(endpoint)
@doc_season = Nokogiri::HTML.parse(URI.open("https://basketball-reference.com/leagues/BAA_1947_totals.html"))
player_season_table = @doc_season.css("tbody")
rows = player_season_table.css("tr")
rows.search('.thead').each(&:remove) #THIS WORKED
rows[0].at_css("td").try(:text) # Gets single player name
rows[0].at_css("a").attributes["href"].try(:value) # Gets that player page URL
However, my rails helper that is meant to take those commands and fold them into methods:
module ScraperHelper
def target_scrape(url)
browser = Watir::Browser.new(:chrome)
browser.goto(url)
doc = Nokogiri::HTML.parse(browser.html)
end
def league_year_prefix(year, league = 'NBA')
# aba_seasons = 1968..1976
baa_seasons = 1947..1949
baa_seasons.include?(year) ? league_year = "BAA_#{year}" : league_year = "#{league}_#{year}"
end
def players_total_of_season(year, league = 'NBA')
# always the latter year of the season, first year is 1947 no quotes
# ABA is 1968 to 1976
league_year = league_year_prefix(year, league)
@doc_season = target_scrape("http://basketball-reference.com/leagues/#{league_year}_totals.html")
end
def gather_players_from_season
player_season_table = @doc_season.css("tbody")
rows = player_season_table.css("tr")
rows.search('.thead').each(&:remove)
puts rows[0].at_css("td").try(:text)
puts rows[0].at_css("a").attributes["href"].try(:value)
end
end
On that module, I try to emulate the rails console commands and break them into modules. And to test it out (since I don't have any other functionality or views built yet), I run Rails console, include this helper and run the methods.
But I get wildly different results. in the gather_players_from_season method, I can see that
player_season_table = @doc_season.css("tbody")
Is no longer grabbing the same data it grabbed when run as a command line by line. It also doesn't like the attributes method here:
puts rows[0].at_css("a").attributes["href"].try(:value)
So my first thought is a difference in gems maybe? Watir is launching the headless browser. Nokogiri isn't causing errors as near as I can tell.
CodePudding user response:
Your first thought of comparing the Gem versions is a great idea, but I am noticing a difference between the two code solutions:
In the Rails Console
the code parses the HTML with URI.open: Nokogiri::HTML.parse(URI.open("some html"))
In the ScraperHelper code
the code does not call URI.open, Nokogiri::HTML.parse("some html")
Perhaps that difference will return different values and make the rest of the ScraperHelper return unexpected results.