Home > Software engineering >  Removing columns from csv when cut and csvfilter both stop before finishing
Removing columns from csv when cut and csvfilter both stop before finishing

Time:12-22

I'm trying to take a large csv file (800,000 rows, 160 columns). I'm trying to remove select columns, but keep all rows. I've tried two different methods--the standard cut command, and csvfilter--but neither of them will return all rows. In fact, they both return different numbers of rows, with cut returning a dozen or so more than csvfilter, but both a little over 4000.

I've looked at the original csv to try to see what might be making it choke, but I can't see anything: no quote marks in the row, no special characters.

Can anyone suggest a reliable method to remove columns from a csv or a way to more effectively troubleshoot csvfilter and/or cut? I'm mostly working on a Mac, but can work on Windows as well.

CodePudding user response:

I recommend GoCSV's select command. It's already built for macOS/darwin, so go straight to the latest release and downloading the binary of your choosing.

I'm not sure why csvfilter would truncate your file. I'm especially skeptical that cut would eliminate any line, but I haven't tried 800K lines before.

Testing cut; comparing GoCSV

Here's a Python script to generate a CSV, large.csv, that is 800_000 rows by 160 columns:

with open('large.csv', 'w') as f:
    # Write header
    cols = ['Count']
    cols  = [f'H{k 1}' for k in range(159)]
    f.write(','.join(cols)   '\n')

    # Write data
    for i in range(800_000):
        cols = [str(i 1)]
        cols  = [f'C{k 1}' for k in range(159)]
        f.write(','.join(cols)   '\n')

Ensure large has 800K lines:

wc -l large.csv
  800001 large.csv

And with GoCSV's dims (dimensions) command:

gocsv dims large.csv 
Dimensions:
  Rows: 800000
  Columns: 160

(GoCSV always counts the first row/line as the "header", which doesn't have any effect for cutting/selecting columns)

And now cutting the columns:

time cut -d ',' -f1,160 large.csv > cut.csv
cut -d, -f1,160 large.csv > cut.csv  8.10s user 0.38s system 99% cpu 8.483 total
time gocsv select -c 1,160 large.csv > gocsv_select.csv
gocsv select -c 1,160 large.csv > gocsv_select.csv  5.25s user 2.55s system 106% cpu 7.322 total

Compare the two methods:

cmp gocsv_select.csv cut.csv

and since they are the same, looking at the head and tail of one counts for both:

head -n2 cut.csv 
Count,H159
1,C159

tail -n2 cut.csv 
799999,C159
800000,C159

So, both did what looks like the right thing, specifically cut didn't filter/drop any lines/rows. And GoCSV, actually did it faster.

I'm curious what your cut command looks like, but I think the bigger point to stress is to use a CSV-aware tool whenever you can (always).

  • Related