I'm trying to take a large csv file (800,000 rows, 160 columns). I'm trying to remove select columns, but keep all rows. I've tried two different methods--the standard cut command, and csvfilter--but neither of them will return all rows. In fact, they both return different numbers of rows, with cut returning a dozen or so more than csvfilter, but both a little over 4000.
I've looked at the original csv to try to see what might be making it choke, but I can't see anything: no quote marks in the row, no special characters.
Can anyone suggest a reliable method to remove columns from a csv or a way to more effectively troubleshoot csvfilter and/or cut? I'm mostly working on a Mac, but can work on Windows as well.
CodePudding user response:
I recommend GoCSV's select command. It's already built for macOS/darwin, so go straight to the latest release and downloading the binary of your choosing.
I'm not sure why csvfilter would truncate your file. I'm especially skeptical that cut would eliminate any line, but I haven't tried 800K lines before.
Testing cut; comparing GoCSV
Here's a Python script to generate a CSV, large.csv, that is 800_000 rows by 160 columns:
with open('large.csv', 'w') as f:
# Write header
cols = ['Count']
cols = [f'H{k 1}' for k in range(159)]
f.write(','.join(cols) '\n')
# Write data
for i in range(800_000):
cols = [str(i 1)]
cols = [f'C{k 1}' for k in range(159)]
f.write(','.join(cols) '\n')
Ensure large has 800K lines:
wc -l large.csv
800001 large.csv
And with GoCSV's dims (dimensions) command:
gocsv dims large.csv
Dimensions:
Rows: 800000
Columns: 160
(GoCSV always counts the first row/line as the "header", which doesn't have any effect for cutting/selecting columns)
And now cutting the columns:
time cut -d ',' -f1,160 large.csv > cut.csv
cut -d, -f1,160 large.csv > cut.csv 8.10s user 0.38s system 99% cpu 8.483 total
time gocsv select -c 1,160 large.csv > gocsv_select.csv
gocsv select -c 1,160 large.csv > gocsv_select.csv 5.25s user 2.55s system 106% cpu 7.322 total
Compare the two methods:
cmp gocsv_select.csv cut.csv
and since they are the same, looking at the head and tail of one counts for both:
head -n2 cut.csv
Count,H159
1,C159
tail -n2 cut.csv
799999,C159
800000,C159
So, both did what looks like the right thing, specifically cut didn't filter/drop any lines/rows. And GoCSV, actually did it faster.
I'm curious what your cut command looks like, but I think the bigger point to stress is to use a CSV-aware tool whenever you can (always).