I have 67 csv files with over 1gb of data combined. Each CSV contains 20 columns, but I need only 3 of those columns per csv file. I can successfully combine all of them into a single csv file, but this creates a mammoth file that I don't need. Instead, I want only 3 of those columns from all 67 files combined into a single file (of course, with only three columns).
My current error message is the index out of range error, but I don't know how to fix that within this code. Here is the code that I am using:
import glob
import pandas as pd
path = "C:/Correct/Path"
file_list = glob.glob(path "/*.csv")
csv_list = []
for file in file_list:
csv_list.append(pd.read_csv(file, delim_whitespace=1, usecols=[9], engine='python'))
csv_merged = pd.DataFrame()
for csv_file in csv_list:
csv_merged = csv_merged.append(
csv_file, ignore_index=True)
csv_merged.to_csv('all_counties.csv', index=False)
CodePudding user response:
You can easily write a little script with standard the library modules csv and pathlib
import csv
from pathlib import Path
p = Path('C:/Correct/Path')
file_list = p.glob("*.csv")
desired_columns = ['foo', 'bar', 'baz']
desired_rows = []
for csv_file in file_list:
with open(csv_file, 'r') as f:
reader = csv.DictReader(f)
for row in reader:
desired_rows.append({c: row[c] for c in desired_columns})
with open('output.csv', 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=desired_columns)
writer.writeheader()
writer.writerows(desired_rows)
input of two different csvs:
foo,bar,baz,spam,eggs
1,2,3,4,5
1,2,3,4,5
1,2,3,4,5
1,2,3,4,5
1,2,3,4,5
1,2,3,4,5
foo,baz,bar,spam,eggs,unused
1,2,3,4,5,6
1,2,3,4,5,6
1,2,3,4,5,6
1,2,3,4,5,6
1,2,3,4,5,6
output:
foo,bar,baz
1,2,3
1,2,3
1,2,3
1,2,3
1,2,3
1,2,3
1,2,3
1,2,3
1,2,3
1,2,3
1,2,3
1,2,3
1,3,2
1,3,2
1,3,2
1,3,2
1,3,2
1,3,2
1,3,2
1,3,2
1,3,2
1,3,2
From there you've got your cleaned data and are free to use pandas as you see fit.
E:
To remove duplicate dicts from desired_rows
you have a few options, depending on the dicts themselves. The general gist of it would be
new_list = []
for d in desired_rows:
if d not in new_list:
new_list.append(d)