Home > Software engineering >  Optimizing multicriteria filtering for data with Pandas
Optimizing multicriteria filtering for data with Pandas

Time:04-10

I'm trying to filter data with Pandas using a list of values which are a couple of str book_tittle and int book_price :

import pandas as pd
import requests
from bs4 import BeautifulSoup


# settings_#############################################################################

isbn = {'9782756002484', '9782756025117', '9782756072449'}
url = 'https://www.abebooks.fr/servlet/SearchResults?sts=t&cm_sp=SearchF-_-NullResults-_-Results&isbn={}'
book_title = ["Mondes", "X-Wing"]
book_price = [100, 10]

#######################################################################################

### creation de lien à partir des codes ISBN#
def url_isbn(isbn):
    merged = []
    for link in isbn:
        link_isbn = url.format(link)
        merged.append(link_isbn)
    return merged

### scraping each url from url_isbn
def get_data():
    data = []
    for i in url_isbn(isbn):
        r = requests.get(i)
        soup = BeautifulSoup(r.text, 'html.parser')
        item = soup.find_all('div', {'class': 'result-data col-xs-9 cf'})
        for x in item:
            title = x.find('h2', {'class': 'title'}).text.replace('\n', '')
            price = x.find('p', {'class': 'item-price'}).text.replace('EUR ', '').replace(',', '.')
            url = 'https://www.abebooks.fr' x.find('a', {'itemprop': 'url'})['href']

            products = title, int(float(price)), url
            data.append(products)
    return data
###creating the dataframe
df = pd.DataFrame(get_data(), columns=["Titre", "Prix", "URL"])

###Filter data into the dataframe
for filtered in df:
    df_final_to_email = filtered[(df['Titre'].str.contains(book_title) & (df.Prix < book_price))]
    print(df_final_to_email)

I'm getting an error : TypeError : unhashable type : 'list'

I assume I cannot use a list for filtering in cause of the mix of data type, I tested with Tuple and dict, I get the same kind of error

I also try with df.query but it gives empty data frame

The filter will allow me to filter all the books which has "Mondes" in the title for a price < 100 but also all the books which contains "X-Wing" below a price of < 10, I'll also add more item to find with a price related.

Titre Prix
Mondes Infernaux 95,10
Star Wars, Mondes Infernaux 75,50
X-Wing Rogue 9,50
X-Wing Rogue Squadron 7,50

Nothing about the filtering, but do you know how I could figure the following :products = title, int(float(price)), url ? I had to use float as I'm not able to convert like int(price) as int, I'm a bit annoyed of having rounded down numbers in the dataframe. (if any moderator can tell me if I have to do another post for this specific need ? thank you)

Thank you for your kind help

CodePudding user response:

If you need all matching rows in the dataframe there's no need to use a for loop.

Maybe try something like this:

def find_book(str, price):
    return df[ (df['Titre'].str.contains(str)) & (df['Prix']<price) ]

# find all books containing the substring 'Wing' in the title with price <7
find_book('Wing', 7)

CodePudding user response:

The error resides in your filtering code:

df_final_to_email = filtered[(df['Titre'].str.contains(book_title) & (df.Prix < book_price))]

book_title is a list. .str.contains does not work with list. It works with a single string or a regex pattern.


If your intention is to find books with "Mondes" in the title and price under 100 or "X-Wing" in the title and price under 10, you can use the following filtering code:

###Filter data into the dataframe
cond = pd.Series([False] * len(df), index=df.index)
for title, price in zip(book_title, book_price):
    cond |= df["Titre"].str.contains(title) & df["Prix"].lt(price)

print(df[cond])
  • Related