Read from one csv file and write to different multiple csv files depending on content-CodePudding

I have a large csv file which I am reading in. Depending on the first two values of the string in column1 I want to output the content to different csv files.

A file could look like this:

Column1;Column2
01;BE
02;ED
12;FD
14;DS
03;ED
04;DF

My code is as follows:

import csv
output_path=r'C:\myfolder\large_file.csv'

with open(os.path.join(os.path.dirname(output_path),"column1_01.csv"), "w", encoding="utf-8", newline='') as \
out_01, open(os.path.join(os.path.dirname(output_path),"column1_02.csv"), "w", encoding="utf-8", newline='') as \
out_02, open(os.path.join(os.path.dirname(output_path),"column1_03.csv"), "w", encoding="utf-8", newline='') as \
out_03, open(os.path.join(os.path.dirname(output_path),"column1_04.csv"), "w", encoding="utf-8", newline='') as \
out_04:
    
    cw01 = csv.writer(out_01, delimiter=";", quoting=csv.QUOTE_MINIMAL)
    cw02 = csv.writer(out_02, delimiter=";", quoting=csv.QUOTE_MINIMAL)
    cw03 = csv.writer(out_03, delimiter=";", quoting=csv.QUOTE_MINIMAL)
    cw04 = csv.writer(out_04, delimiter=";", quoting=csv.QUOTE_MINIMAL)

    with open(output_path, encoding="utf-8") as in_f:
        cr = csv.reader(in_f, delimiter=";")
        cw01.writerow(next(cr))
        cw02.writerow(next(cr))
        cw03.writerow(next(cr))
        cw04.writerow(next(cr))

        for line in cr:
            if (line[0][:2] =="01"): cw01.writerow(line)  
            if (line[0][:2] =="02"): cw02.writerow(line)  
            if (line[0][:2] =="03"): cw03.writerow(line)  
            if (line[0][:2] =="04"): cw04.writerow(line)

My problem now is that when I add the next line, the output for "05" I get an error stopiteration (I checked it with sample data and there isn't much data rows in it, so I guess it has to do something with it). Furthermore the main problem is when checking the files I can see that it did not work. It does not ouput the records into the correct files properly. Where is my mistake?

I think the problem might be all the next(cr). What I want to do is just to use the header from the original large csv and have it for each csv. The header is the same. I need all files at the end with the header.

I need a pure csv solution. No other packages.

I would like to find the precise error in my code. Why is this not working and where is the mistake? I do not want any further changes introducing exception handling like try, functions like def or any other. I do not need a generic solution. I want to find the specific error in my code.

The original large file has many columns, so the header is quite long. Therefore I would like a solution where I do not have to manually type in all the columuns to add it to an header.

CodePudding user response：

Python built-in function next Retrieve the next item from the iterator by calling its __next__() method. Thus, the iterator move fowards to the next element. Hence you call next to many times to get the header.

[...]
    with open(output_path, 'r', encoding="utf-8") as in_f:
        cr = csv.reader(in_f, delimiter=";")
        header = next(cr)
        cw01.writerow(header)
        cw02.writerow(header)
        cw03.writerow(header)
        cw04.writerow(header)

        for line in cr:
[...]

Hoping that helps.

CodePudding user response：

What happens when you do this? I shuffled your code around to make it generic, so that you can give any list of cols and it will make 1 output file per col, and look up the col in a dictionary of csv writers.

import csv
output_path = r'C:\myfolder\large_file.csv'


def get_fhandle(col_name):
  return open(
    os.path.join(os.path.dirname(output_path), f"column1_{col_name}.csv"),
    "w", encoding="utf-8", newline=''
  )

cols = ["01", "02", "03", "04", "05"]
files = {col: get_fhandle(col) for col in cols}
try:
  writers = {
    col: csv.writer(file, delimiter=";", quoting=csv.QUOTE_MINIMAL)
    for col, file in files.items()
  }
  with open(output_path, encoding="utf-8") as in_f:
    cr = csv.reader(in_f, delimiter=";")
    header = next(cr)
    for writer in writers.values():
      writer.writerow(header)
    for line in cr:
      col = line[0][:2]
      if col not in writers:
        raise ValueError(
          f"'{col}' is not a known column. Should be: {cols.join(', ')}"
        )
      writers[col].writerow(line)
finally:
  for file in files.values():
    file.close()

CodePudding user response：

The use of contextlib.ExitStack() allows one to clean things up substantially, particularly if we set up a little dictionary to map what keys in our input map to what output file we want to write to.

import contextlib
import csv

files_out = {
    "01": "column1_01.csv",
    "02": "column1_02.csv",
    "03": "column1_03.csv",
    "04": "column1_04.csv",
}

with contextlib.ExitStack() as stack:
    ## ---------------------
    ## Set up a bunch of csv writers
    ## ---------------------
    writers = {
        key: csv.writer(stack.enter_context(open(value, "w", newline="")))
        for key, value
        in files_out.items()
    }
    ## ---------------------

    file_in = stack.enter_context(open("small.csv"))
    rows = csv.reader(file_in, delimiter=";")

    ## ---------------------
    ## write a header to each output file
    ## ---------------------
    headers = next(rows)
    for writer in writers.values():
        writer.writerow(headers)
    ## ---------------------

    ## ---------------------
    ## write rows to the appropriate file
    ## ---------------------
    for row in rows:
        key = row[0]
        if not key in writers:
            print(f"no file defined for key {key}")
            continue
        writers[key].writerow(row)
    ## ---------------------

To answer your specific question of why your code does not currently work, let's look at:

with open(output_path, encoding="utf-8") as in_f:
    cr = csv.reader(in_f, delimiter=";")
    cw01.writerow(next(cr))
    cw02.writerow(next(cr))
    cw03.writerow(next(cr))
    cw04.writerow(next(cr))

    for line in cr:
        if (line[0][:2] =="01"): cw01.writerow(line)  
        if (line[0][:2] =="02"): cw02.writerow(line)  
        if (line[0][:2] =="03"): cw03.writerow(line)  
        if (line[0][:2] =="04"): cw04.writerow(line)

here:

cw01.writerow(next(cr))

essentially writes the first row (the header) of your input file to the first output file. We then repeat this writing successive rows of input to the outputs.

After writing the first 4 lines of input to one line in each of the four outputs we then read the remaining lines in the input file (the last two in the example) and write them to files 3 and 4 as appropriate

So, your specific solution would be to use next() once to get the header from your input file and then write that to each of your output files:

with open(output_path, encoding="utf-8") as in_f:
    cr = csv.reader(in_f, delimiter=";")
    headers = next(cr)

    cw01.writerow(headers)
    cw02.writerow(headers)
    cw03.writerow(headers)
    cw04.writerow(headers)

    for line in cr:
        if (line[0][:2] =="01"): cw01.writerow(line)  
        if (line[0][:2] =="02"): cw02.writerow(line)  
        if (line[0][:2] =="03"): cw03.writerow(line)  
        if (line[0][:2] =="04"): cw04.writerow(line)