I am scraping https://books.toscrape.com using Haskell's Scalpel library. Here's my code so far:
import Text.HTML.Scalpel
import Data.List.Split (splitOn)
import Data.List (sortBy)
import Control.Monad (liftM2)
data Entry = Entry {entName :: String
, entPrice :: Float
, entRate :: Int
} deriving Eq
instance Show Entry where
show (Entry n p r) = "Name: " n "\nPrice: " show p "\nRating: " show r "/5\n"
entries :: Maybe [Entry]
entries = Just []
scrapePage :: Int -> IO ()
scrapePage num = do
items <- scrapeURL ("https://books.toscrape.com/catalogue/page-" show num ".html") allItems
let sortedItems = items >>= Just . sortBy (\(Entry _ a _) (Entry _ b _) -> compare a b)
>>= Just . filter (\(Entry _ _ r) -> r == 5)
maybe (return ()) (mapM_ print) sortedItems
allItems :: Scraper String [Entry]
allItems = chroots ("article" @: [hasClass "product_pod"]) $ do
p <- text $ "p" @: [hasClass "price_color"]
t <- attr "href" $ "a"
star <- attr "class" $ "p" @: [hasClass "star-rating"]
let fp = read $ flip (!!) 1 $ splitOn "£" p
let fStar = drop 12 star
return $ Entry t fp $ r fStar
where
r f = case f of
"One" -> 1
"Two" -> 2
"Three" -> 3
"Four" -> 4
"Five" -> 5
main :: IO ()
main = mapM_ scrapePage [1..10]
Basically, allItems
scrapes for each book's title, price and rating, does some formatting for price to get a float, and returns it as a type Entry
. scrapePage
takes a number corresponding to the result page number, scrapes that page to get IO (Maybe [Entry])
, formats it - in this case, to filter for 5-star books and order by price - and prints each Entry. main
performs scrapePage
over pages 1 to 10.
The problem I've run into is that my code scrapes, filters and sorts each page, whereas I want to scrape all the pages then filter and sort.
What worked for two pages (in GHCi) was:
i <- scrapeURL ("https://books.toscrape.com/catalogue/page-1.html") allItems
j <- scrapeURL ("https://books.toscrape.com/catalogue/page-2.html") allItems
liftM2 ( ) i j
This returns a list composed of page 1 and 2's results that I could then print, but I don't know how to implement this for all 50 result pages. Help would be appreciated.
CodePudding user response:
Just return the entry list without any processing (or you can do filtering in this stage)
-- no error handling
scrapePage :: Int -> IO [Entry]
scrapePage num =
concat . maybeToList <$> scrapeURL ("https://books.toscrape.com/catalogue/page-" show num ".html") allItems
Then you can process them later together
process = filter (\e -> entRate e == 5) . sortOn entPrice
main = do
entries <- concat <$> mapM scrapePage [1 .. 10]
print $ process entries
Moreover you can easily make your code concurrent with mapConcurrently
from Data.Async
package
main = do
entries <- concat <$> mapConcurrently scrapePage [1 .. 20]
print $ process entries