Home > Software design >  Concatenating scrapeURL results from multiples scrapings into one list
Concatenating scrapeURL results from multiples scrapings into one list

Time:03-09

I am scraping https://books.toscrape.com using Haskell's Scalpel library. Here's my code so far:

import Text.HTML.Scalpel
import Data.List.Split (splitOn)
import Data.List (sortBy)
import Control.Monad (liftM2)

data Entry = Entry {entName :: String
                   , entPrice :: Float
                   , entRate :: Int
                   } deriving Eq

instance Show Entry where
  show (Entry n p r) = "Name: "    n    "\nPrice: "    show p    "\nRating: "    show r    "/5\n"

entries :: Maybe [Entry]
entries = Just []

scrapePage :: Int -> IO ()
scrapePage num = do
  items <- scrapeURL ("https://books.toscrape.com/catalogue/page-"    show num    ".html") allItems
  let sortedItems = items >>= Just . sortBy (\(Entry _ a _) (Entry _ b _) -> compare a b)
                          >>= Just . filter (\(Entry _ _ r) -> r == 5)
  maybe (return ()) (mapM_ print) sortedItems

allItems :: Scraper String [Entry]
allItems = chroots ("article" @: [hasClass "product_pod"]) $ do
    p <- text $ "p" @: [hasClass "price_color"]
    t <- attr "href" $ "a"
    star <- attr "class" $ "p" @: [hasClass "star-rating"]
    let fp = read $ flip (!!) 1 $ splitOn "£" p
    let fStar = drop 12 star
    return $ Entry t fp $ r fStar
      where
        r f = case f of
          "One" -> 1
          "Two" -> 2
          "Three" -> 3
          "Four" -> 4
          "Five" -> 5

main :: IO ()
main = mapM_ scrapePage [1..10]

Basically, allItems scrapes for each book's title, price and rating, does some formatting for price to get a float, and returns it as a type Entry. scrapePage takes a number corresponding to the result page number, scrapes that page to get IO (Maybe [Entry]), formats it - in this case, to filter for 5-star books and order by price - and prints each Entry. main performs scrapePage over pages 1 to 10.

The problem I've run into is that my code scrapes, filters and sorts each page, whereas I want to scrape all the pages then filter and sort.

What worked for two pages (in GHCi) was:

i <- scrapeURL ("https://books.toscrape.com/catalogue/page-1.html") allItems
j <- scrapeURL ("https://books.toscrape.com/catalogue/page-2.html") allItems
liftM2 (  ) i j

This returns a list composed of page 1 and 2's results that I could then print, but I don't know how to implement this for all 50 result pages. Help would be appreciated.

CodePudding user response:

Just return the entry list without any processing (or you can do filtering in this stage)

-- no error handling 
scrapePage :: Int -> IO [Entry]
scrapePage num =
  concat . maybeToList <$> scrapeURL ("https://books.toscrape.com/catalogue/page-"    show num    ".html") allItems

Then you can process them later together

process = filter (\e -> entRate e == 5) . sortOn entPrice
main = do
  entries <- concat <$> mapM scrapePage [1 .. 10]
  print $ process entries

Moreover you can easily make your code concurrent with mapConcurrently from Data.Async package

main = do
 entries <- concat <$> mapConcurrently scrapePage [1 .. 20]
 print $ process entries
  • Related