Home > Software engineering >  Sort a CSV in Python by a column with a numerical value (and a header)
Sort a CSV in Python by a column with a numerical value (and a header)

Time:11-23

I have a csv file called "CleanReport.csv" with some sample data, shown here:

Name,Color,Age
Mark,Red,9999
Bob,Red,712
Alice,Green,1
Lisa,Pink,99
Jacob,Yellow,33
Corey,Orange,44

And using Python I'm trying to sort by the column 'Age' (at index 2) without loosing the header column's place at the beginning. Here is my code:

import csv
import operator

with open('CleanReport.csv', 'r') as sortrow:
  reader = csv.reader(sortrow.readlines(), delimiter=',')
  sortedlist = sorted(reader, key=operator.itemgetter(int(2)), reverse=True)

  for row in sortedlist:
    print(row)

I am very close. However, this is my result:

['Name', 'Color', 'Age']
['Mark', 'Red', '9999']
['Lisa', 'Pink', '99']
['Bob', 'Red', '712']
['Corey', 'Orange', '44']
['Jacob', 'Yellow', '333']
['Alice', 'Green', '1']

Where my desired result would be this:

['Name', 'Color', 'Age']
['Mark', 'Red', '9999']
['Bob', 'Red', '712']
['Jacob', 'Yellow', '333']
['Lisa', 'Pink', '99']
['Corey', 'Orange', '44']
['Alice', 'Green', '1']

As you can see it treats the contents of the third column as if they were a string, even after I cast it to int, and sorts them as if they were strings.

I also tried this code #sortedlist = sorted(reader, key=lambda x: int(x[2]), reverse=True), which sorts numerically as I want it to. However that only works if I remove the header row from the test data.

Is there an easier fix to this problem without having to

  1. take out the header row somehow and store it in a variable
  2. sort the data using #sortedlist = sorted(reader, key=lambda x: int(x[2]), reverse=True)
  3. create a new csv with the header row appended to it
  4. append each row in sortedlist to the same new csv one by one using a loop?

CodePudding user response:

One way of solving it is reading it as a dict with DictReader

import csv
from pprint import pp #not needed, but prints the dict better

with open('CleanReport.csv', 'r') as f:
  reader = csv.DictReader(f.read().splitlines(), delimiter=',')
  sorted_dict = sorted(reader, key=lambda x: int(x["Age"]), reverse=True)

  pp(sorted_dict)

output

[{'Name': 'Mark', 'Color': 'Red', 'Age': '9999'},
 {'Name': 'Bob', 'Color': 'Red', 'Age': '712'},
 {'Name': 'Lisa', 'Color': 'Pink', 'Age': '99'},
 {'Name': 'Corey', 'Color': 'Orange', 'Age': '44'},
 {'Name': 'Jacob', 'Color': 'Yellow', 'Age': '33'},
 {'Name': 'Alice', 'Color': 'Green', 'Age': '1'}]

CodePudding user response:

There's a general way to do this with just the CSV reader, and nothing else fancy, that'll get you:

  1. a header
  2. the "data" sorted like you want
  3. keep variables to a minimum, just one, sorted_list

Your code looks goods, but your use of the CSV reader is a little off. I'd definitely give the documentation for csv.reader a read and make note of what is going on in the example, and especially just giving the reader your file, and not calling readlines(). That example is also a good template for default values you don't need to specify (like, 'r', and delimter=',').

import csv

sorted_list = []
with open('CleanReport.csv', newline='') as f:
    reader = csv.reader(f)
    sorted_list.append(next(reader))  # add header as first row

    sorted_list.extend(
        sorted(reader, key=lambda row: int(row[2]), reverse=True)
    )

for row in sorted_list:
    print(row)

When I run that against your sample CSV, I get:

['Name', 'Color', 'Age']
['Mark', 'Red', '9999']
['Bob', 'Red', '712']
['Lisa', 'Pink', '99']
['Corey', 'Orange', '44']
['Jacob', 'Yellow', '33']
['Alice', 'Green', '1']

You were looking for an "easier fix" than just about the bare minimum number of steps it takes to solve this problem... it doesn't (cannot) get any easier than:

  1. read/store header
  2. read all "data" rows
  3. sort all "data" rows
  4. prepend header to sorted data rows

Also, if you all need is an automated way to sort a CSV file, I'm very fond of CSV command-line tools like csvkit's csvsort, and GoCSV's sort. Both can perform basic type inference and sort your second column as the integers they are. I'd written dozens and dozes of one-off Python scripts to handle mundane CSV processes, and there were usually small, stupid bugs. These will always get the job done right: just one command, or pipeline of commands; no more lines and lines of Python.

  • Related