I am scraping the website But I only need the year I have made a split but I get an error.
CodePudding user response:
You could use REGEX to get only the year after having the list.
Of course, if what you showing is the pattern. Will work. Years would be the only one with 4 straight digits.
Example:
17.01.2023, 17:40
this \b\d{4}\b
will result in 2023.
CodePudding user response:
Your issue is that if you look at the output from this block
re = links3.map do |lk3|
lk3.css('.name').children.text.strip.split("\n")[2]
end
You will see:
[" 07.08.2016, 13:47", nil, nil, nil, nil, " 06.08.2016, 9:24", nil, nil, nil, nil,...]
So you could solve your immediate issue by just adding .compact
to the end
That being said here is another way to solve your issue:
You can get just the year from that text on that page using the following:
require 'nokogiri'
require 'open-uri'
url = "https://www.bananatic.com/de/forum/games/"
doc = Nokogiri::HTML(URI.open(url))
doc
.xpath('//div[@]/text()[string-length(normalize-space(.)) > 0]')
.map {|node| node.to_s[/\d{4}/]}
#=> ["2016", "2016", "2022", "2022", "2022", "2021", "2022", "2017", "2022", "2021", "2019", "2016", "2021", "2021", "2021", "2021", "2020", "2021", "2017", "2021"]
The 2 parts are:
//div[@]/text()[string-length(normalize-space(.)) > 0]
- the XPath which finds all divs with the class "name" and then pulls the non zero length (trimmed of white space) text nodes..map {|node| node.to_s[/\d{4}/]}
- map these into an array by slicing the String based on a regex for 4 contiguous digits.
If you would like the XPath to be as specific as your post you can use:
'//div[@]/ul/li//div[@]/text()[string-length(normalize-space(.)) > 0]'