I´ve been struggling a bit to find a way in python to force this file to create a jump to a new line after some number of elements (equal to the number of columns I will need to add which is 12) the CSV currently looks like this.
the text of the first row looks like this .
D276",31386,10610,12122021 00:00:47840 85,0.00 842646,M000395708109323,ACTIVE CARD CHECK,844-6593879,NY,59655,840 6511011091718056,D276,31386,10610,12122021 00:00:59840Y00,5.36-842647,M527021000201360,Etsy.com - TheCraftyCa Brooklyn,NY,56995,840 6511011091718056,D276,86495,29807,12122021 00:08:22840N51,11.99-842648,M248747000103177,GOOGLE *YouTubePremium g.co/helppay# CA,78295,840 6511016547548056,D276,29969,10038,12122021 00:27:19840 57,11.30-842649,M000445474354997,SPOTIFY,NEW YORK,NY,48995,840 6511010952148056,D276,62521,21152,12122021 00:28:54840N51,5.40-842650,M527021000211443,Google Play,Mountain View CA,58175,840 6511014173278056,D276,802,701,12122021 00:30:38840Y00,49.67-842651,M235251000762203,AMZN Mktp US,Amzn.com/bill WA,59425,840 6511010003058056,D276,114710,41280,12122021 00:31:22840Z00,21.92-842652,M000445488848992,DD *DOORDASH MCDONALDS SAN FRANCISCO CA,58125,840 6511019296778056,D276,125175,45529,12122021 00:31:50840Y05,0.00 842653,M145376000144509,PLAYSTATION NETWORK,800-345-7669 CA,58165,840 6511020299078056,D276,125175,45529,12122021 00:32:07840Y57,21.44-842654,M145376000144509,PLAYSTATION NETWORK,800-345-7669 CA,58165,840 6511020299078056,D276,125175,45529,12122021 00:32:08840Y57,21.44-842299,M527021000222747,PlaystationNetwork,San Mateo,CA,58185,840 6511020299078056,D276,125175,45529,12122021 00:32:09840Y57,21.44-842300,M527021000222747,PlaystationNetwork,San Mateo,CA,58185,840 6511020299078056,D276,125175,45529,12122021 00:32:09840 57,0.00 842655,MCARD ACCPT IDC,Sony - Playstation N.. St. Louis,USA,59695,840 6511020299078056,D276,125175,45529,12122021 00:32:27840Y57,21.44-842301,M145376000144509,PLAYSTATION NETWORK,800-345-7669 CA,58165,840 6511020299078056,D276,125175,45529,12122021 00:32:28840Y57,21.44-842657,M527021000222747,PlaystationNetwork,San Mateo,CA,58185,840 6511020299078056,D276,125175,45529,12122021 00:32:28840Y57,21.44-842656,M527021000222747,PlaystationNetwork,San Mateo,CA,58185,840 6511020299078056,D276,125175,45529,12122021 00:32:29840 57,0.00 842658,MCARD ACCPT IDC,Sony - Playstation N.. St. Louis,USA,59695,840 6511020299078056,D276,112802,40216,12122021 00:32:30840Y00,6.49-842659,M784959000762203,Amazon.com,Amzn.com/bill WA,59425,840 6511019112388056,D276,120407,44199,12122021 00:35:24840 05,3.12-67433,P536385810103481,MILLS FOOD CENTER,OAKLAND,CA,54115,840 6511019841028056,D276,120407,44199,12122021 00:35:48840 05,2.29-67434,P536385810103481,MILLS FOOD CENTER,OAKLAND,CA,54115,840 6511019841028056,D276,129143,47047,12122021
and I would like it to look something like this
Continuing until it completed all the registers in the original file.
CodePudding user response:
The first thing I would try is simply split the line on commas, and write records using csv.writer
, calling .writerow()
with twelve elements at a time. I notice you have a double quote at the beginning, but not later, so this approach might be good enough, you would just have to remove that double quote. Of course, if any field in your file has commas within its text, my suggestion will fall appart, but it's a place to start, since you seem to be trying to fix one specific file, rather than solving a general problem.
Here's my implementation of that suggestion:
import csv
out_f = open("fixed-csv.txt", mode="w")
writer = csv.writer(out_f)
with open("bad-csv.txt") as in_f:
for line in in_f:
fields = line.strip("\n\r").split(",")
for position in range(0, len(fields), 12):
writer.writerow(fields[position:position 12])
Now, I noticed running that code that you don't actually have exactly 12 columns per row, it's more like 10 or 11, and it's not constant.
Here's a variant that looks for D276
and makes it the first column of each row:
import csv
out_f = open("fixed-csv-2.txt", mode="w")
writer = csv.writer(out_f)
with open("bad-csv.txt") as in_f:
for line in in_f:
fields = line.strip("\n\r").split(",")
d276_positions = [
i
for i, value in enumerate(fields)
if i == 0 or value == "D276"
]
d276_positions.append(len(fields))
for start, end in zip(d276_positions, d276_positions[1:]):
writer.writerow(fields[start:end])
I don't imagine all your data will have D276
as the first value in the row, so you might have to change if i == 0 or value == "D276"
to something that more generally locates the field that flags a new row, but this code should set you up with that you need to solve your problem, assuming, as I said at the beginning, that you don't have commas inside any fields in your whole data file.
If you do have commas in some of your fields, I would manually edit the output file with a text editor and patch the problems by hand. If there aren't too many of them, it shouldn't be a lot of work.
CodePudding user response:
Here is a pandas
numpy
approach.
import io
import numpy as np
import pandas as pd
data =""""
D276",31386,10610,12122021 00:00:47840 85,...
"""
df = pd.read_csv(io.StringIO(data), delimiter=",", quoting=3, header=None)
# resize array to multiple of cols
cols = 11
remainder = cols - df.shape[1] % cols
values = np.append(df.to_numpy(), np.empty((1,remainder)))
df_reshaped = pd.DataFrame(values.reshape((-1,cols)))
Output:
0 1 2 3 4 5 6 7 8 9 10
0 D276" 31386 10610 12122021 00:00:47840 85 0.00 842646 M000395708109323 ACTIVE CARD CHECK 844-6593879 NY 59655 840 6511011091718056
1 D276 31386 10610 12122021 00:00:59840Y00 5.36-842647 M527021000201360 Etsy.com - TheCraftyCa Brooklyn NY 56995 840 6511011091718056 D276
2 86495 29807 12122021 00:08:22840N51 11.99-842648 M248747000103177 GOOGLE *YouTubePremium g.co/helppay# CA 78295 840 6511016547548056 D276 29969 10038
Please notice, breaking after cols
columns works only for the first two rows. You might want to consider another criteria e. g. 'D276'
, as in joanis answer.