How to compare cells with string formats in .csv file and return the top five highest python-CodePudding

I am scraping a web page and getting a list of authors with their rates. saved the data in a .csv file and now would like to process the gathered data and create a top list of the most rated 5 authors.

here is how the .csv file look like:

Here is what I have done so far:

import csv


with open ('goodreads-book.csv', 'r') as csv_file:
    csv_reader = csv.reader(csv_file)

    next(csv_reader)

    with open("TopFiveRatedAuthors.csv", 'w') as new_file:

        for line in csv_reader:
            rate = line[1]
            rate = rate[19:-8]
            # print(rate)
            if (rate) > ('100,000'):
                # print (rate)
                t =line
                csv_writer = csv.writer(new_file)
                csv_writer.writerow(line)

and my question is on line:

if str(rate) > '100,000':

right now it returns some random cells, however, I would like to write a code here to compare the cells dynamically and only return the top highest rated. I am quite new to this topic and I would really appreciate any help.

CodePudding user response：

Because of the way Python compares strings, trying to compare numerical strings will not always work. Example: '10000' > '900' will return False. If you want to compare the strings, convert them to numbers with something like:

rate = rate[19:-8]
rate = int(rate.replace(',','')) #get rid of commas before conversion
if rate > 100000: #compare integers

CodePudding user response：

You could probably split the Rate columns on spaces. The 5th part should be the rating. Then remove the , and convert to an integer. For example:

import csv

with open ('goodreads-book.csv', 'r') as f_input, open('TopFiveRatedAuthors.csv', 'w', newline='') as f_output:
    csv_input = csv.reader(f_input)
    header = next(csv_input)
    csv_output = csv.writer(f_output)
    csv_output.writerow(header)     # copy header to output

    for row in csv_input:
        rating = int(row[1].split(' ')[4].replace(',', ''))
        
        if rating > 100000:
            csv_output.writerow(row)

You need to put a textual version of your CSV file into your question to allow it to be tested. If there is a problem, add print(row) to see which row it fails on and then also print(row[1].split(' ')) to see if it is splitting correctly.

For example 4.13 avg rating -- 615,027 ratings should be split into the list:

['4.13', 'avg', 'rating', '--', '615,027', 'ratings']

So [4] is needed for the number.