Concatenating scrapeURL results from multiples scrapings into one list-CodePudding

I am scraping https://books.toscrape.com using Haskell's Scalpel library. Here's my code so far:

import Text.HTML.Scalpel
import Data.List.Split (splitOn)
import Data.List (sortBy)
import Control.Monad (liftM2)

data Entry = Entry {entName :: String
                   , entPrice :: Float
                   , entRate :: Int
                   } deriving Eq

instance Show Entry where
  show (Entry n p r) = "Name: "    n    "\nPrice: "    show p    "\nRating: "    show r    "/5\n"

entries :: Maybe [Entry]
entries = Just []

scrapePage :: Int -> IO ()
scrapePage num = do
  items <- scrapeURL ("https://books.toscrape.com/catalogue/page-"    show num    ".html") allItems
  let sortedItems = items >>= Just . sortBy (\(Entry _ a _) (Entry _ b _) -> compare a b)
                          >>= Just . filter (\(Entry _ _ r) -> r == 5)
  maybe (return ()) (mapM_ print) sortedItems

allItems :: Scraper String [Entry]
allItems = chroots ("article" @: [hasClass "product_pod"]) $ do
    p <- text $ "p" @: [hasClass "price_color"]
    t <- attr "href" $ "a"
    star <- attr "class" $ "p" @: [hasClass "star-rating"]
    let fp = read $ flip (!!) 1 $ splitOn "£" p
    let fStar = drop 12 star
    return $ Entry t fp $ r fStar
      where
        r f = case f of
          "One" -> 1
          "Two" -> 2
          "Three" -> 3
          "Four" -> 4
          "Five" -> 5

main :: IO ()
main = mapM_ scrapePage [1..10]

Basically, allItems scrapes for each book's title, price and rating, does some formatting for price to get a float, and returns it as a type Entry. scrapePage takes a number corresponding to the result page number, scrapes that page to get IO (Maybe [Entry]), formats it - in this case, to filter for 5-star books and order by price - and prints each Entry. main performs scrapePage over pages 1 to 10.

The problem I've run into is that my code scrapes, filters and sorts each page, whereas I want to scrape all the pages then filter and sort.

What worked for two pages (in GHCi) was:

i <- scrapeURL ("https://books.toscrape.com/catalogue/page-1.html") allItems
j <- scrapeURL ("https://books.toscrape.com/catalogue/page-2.html") allItems
liftM2 (  ) i j

This returns a list composed of page 1 and 2's results that I could then print, but I don't know how to implement this for all 50 result pages. Help would be appreciated.

CodePudding user response：

Just return the entry list without any processing (or you can do filtering in this stage)

-- no error handling 
scrapePage :: Int -> IO [Entry]
scrapePage num =
  concat . maybeToList <$> scrapeURL ("https://books.toscrape.com/catalogue/page-"    show num    ".html") allItems

Then you can process them later together

process = filter (\e -> entRate e == 5) . sortOn entPrice
main = do
  entries <- concat <$> mapM scrapePage [1 .. 10]
  print $ process entries

Moreover you can easily make your code concurrent with mapConcurrently from Data.Async package

main = do
 entries <- concat <$> mapConcurrently scrapePage [1 .. 20]
 print $ process entries