insert line break in python file after a fix number of elements to delimit columns in a csv file-CodePudding

I´ve been struggling a bit to find a way in python to force this file to create a jump to a new line after some number of elements (equal to the number of columns I will need to add which is 12) the CSV currently looks like this.

the text of the first row looks like this .

D276",31386,10610,12122021 00:00:47840 85,0.00 842646,M000395708109323,ACTIVE CARD CHECK,844-6593879,NY,59655,840 6511011091718056,D276,31386,10610,12122021 00:00:59840Y00,5.36-842647,M527021000201360,Etsy.com - TheCraftyCa Brooklyn,NY,56995,840 6511011091718056,D276,86495,29807,12122021 00:08:22840N51,11.99-842648,M248747000103177,GOOGLE *YouTubePremium g.co/helppay# CA,78295,840 6511016547548056,D276,29969,10038,12122021 00:27:19840 57,11.30-842649,M000445474354997,SPOTIFY,NEW YORK,NY,48995,840 6511010952148056,D276,62521,21152,12122021 00:28:54840N51,5.40-842650,M527021000211443,Google Play,Mountain View CA,58175,840 6511014173278056,D276,802,701,12122021 00:30:38840Y00,49.67-842651,M235251000762203,AMZN Mktp US,Amzn.com/bill WA,59425,840 6511010003058056,D276,114710,41280,12122021 00:31:22840Z00,21.92-842652,M000445488848992,DD *DOORDASH MCDONALDS SAN FRANCISCO CA,58125,840 6511019296778056,D276,125175,45529,12122021 00:31:50840Y05,0.00 842653,M145376000144509,PLAYSTATION NETWORK,800-345-7669 CA,58165,840 6511020299078056,D276,125175,45529,12122021 00:32:07840Y57,21.44-842654,M145376000144509,PLAYSTATION NETWORK,800-345-7669 CA,58165,840 6511020299078056,D276,125175,45529,12122021 00:32:08840Y57,21.44-842299,M527021000222747,PlaystationNetwork,San Mateo,CA,58185,840 6511020299078056,D276,125175,45529,12122021 00:32:09840Y57,21.44-842300,M527021000222747,PlaystationNetwork,San Mateo,CA,58185,840 6511020299078056,D276,125175,45529,12122021 00:32:09840 57,0.00 842655,MCARD ACCPT IDC,Sony - Playstation N.. St. Louis,USA,59695,840 6511020299078056,D276,125175,45529,12122021 00:32:27840Y57,21.44-842301,M145376000144509,PLAYSTATION NETWORK,800-345-7669 CA,58165,840 6511020299078056,D276,125175,45529,12122021 00:32:28840Y57,21.44-842657,M527021000222747,PlaystationNetwork,San Mateo,CA,58185,840 6511020299078056,D276,125175,45529,12122021 00:32:28840Y57,21.44-842656,M527021000222747,PlaystationNetwork,San Mateo,CA,58185,840 6511020299078056,D276,125175,45529,12122021 00:32:29840 57,0.00 842658,MCARD ACCPT IDC,Sony - Playstation N.. St. Louis,USA,59695,840 6511020299078056,D276,112802,40216,12122021 00:32:30840Y00,6.49-842659,M784959000762203,Amazon.com,Amzn.com/bill WA,59425,840 6511019112388056,D276,120407,44199,12122021 00:35:24840 05,3.12-67433,P536385810103481,MILLS FOOD CENTER,OAKLAND,CA,54115,840 6511019841028056,D276,120407,44199,12122021 00:35:48840 05,2.29-67434,P536385810103481,MILLS FOOD CENTER,OAKLAND,CA,54115,840 6511019841028056,D276,129143,47047,12122021

and I would like it to look something like this

Continuing until it completed all the registers in the original file.

CodePudding user response：

The first thing I would try is simply split the line on commas, and write records using csv.writer, calling .writerow() with twelve elements at a time. I notice you have a double quote at the beginning, but not later, so this approach might be good enough, you would just have to remove that double quote. Of course, if any field in your file has commas within its text, my suggestion will fall appart, but it's a place to start, since you seem to be trying to fix one specific file, rather than solving a general problem.

Here's my implementation of that suggestion:

import csv

out_f = open("fixed-csv.txt", mode="w")
writer = csv.writer(out_f)
with open("bad-csv.txt") as in_f:
    for line in in_f:
        fields = line.strip("\n\r").split(",")
        for position in range(0, len(fields), 12):
            writer.writerow(fields[position:position 12])

Now, I noticed running that code that you don't actually have exactly 12 columns per row, it's more like 10 or 11, and it's not constant.

Here's a variant that looks for D276 and makes it the first column of each row:

import csv

out_f = open("fixed-csv-2.txt", mode="w")
writer = csv.writer(out_f)
with open("bad-csv.txt") as in_f:
    for line in in_f:
        fields = line.strip("\n\r").split(",")
        d276_positions = [
            i
            for i, value in enumerate(fields)
            if i == 0 or value == "D276"
        ]
        d276_positions.append(len(fields))
        for start, end in zip(d276_positions, d276_positions[1:]):
            writer.writerow(fields[start:end])

I don't imagine all your data will have D276 as the first value in the row, so you might have to change if i == 0 or value == "D276" to something that more generally locates the field that flags a new row, but this code should set you up with that you need to solve your problem, assuming, as I said at the beginning, that you don't have commas inside any fields in your whole data file.

If you do have commas in some of your fields, I would manually edit the output file with a text editor and patch the problems by hand. If there aren't too many of them, it shouldn't be a lot of work.

CodePudding user response：

Here is a pandas numpy approach.

import io
import numpy as np
import pandas as pd
data =""""
D276",31386,10610,12122021 00:00:47840 85,...
"""

df = pd.read_csv(io.StringIO(data), delimiter=",", quoting=3, header=None)

# resize array to multiple of cols
cols = 11
remainder = cols - df.shape[1] % cols
values = np.append(df.to_numpy(), np.empty((1,remainder)))

df_reshaped = pd.DataFrame(values.reshape((-1,cols)))

Output:

0   1   2   3   4   5   6   7   8   9   10
0   D276"   31386   10610   12122021 00:00:47840 85 0.00 842646 M000395708109323    ACTIVE CARD CHECK   844-6593879 NY  59655   840 6511011091718056
1   D276    31386   10610   12122021 00:00:59840Y00 5.36-842647 M527021000201360    Etsy.com - TheCraftyCa Brooklyn NY  56995   840 6511011091718056    D276
2   86495   29807   12122021 00:08:22840N51 11.99-842648    M248747000103177    GOOGLE *YouTubePremium g.co/helppay# CA 78295   840 6511016547548056    D276    29969   10038

Please notice, breaking after cols columns works only for the first two rows. You might want to consider another criteria e. g. 'D276', as in joanis answer.