Python: Reading a Windows generated csv with carriage return in column-CodePudding

I'm working on a Python program that needs to read csv files that are produced on a Windows 2012 server machine. The aim of the Python code is to give a min/max of each column in the data, for data quality reasons. So I cannot change the source data.

The delimiter of the data file is |-|. Yes, it's a multi character delimiter. The data is encoded as latin-1. And recently I found out some of the columns can have carriage returns. And I also have to deal with that in my program.

Here is an example of the data:

123456781|-|!|-|04/08/2022 00:00:00|-|04/22/2022 00:00:00|-|04/08/2022 00:00:00|-|2|-|987654|-|12|-|12345|-|D|-|J|-|N|-|12.3400|-|0.0000|-|0.0000|-|1.2300|-|12345.6700|-|J|-|N|-|J|-|N|-||-|23.4500|-|J|-|04/06/2022 11:00:00|-|ABC001      |-||-||-|0.0000
123456782|-|"|-|04/19/2022 00:00:00|-|05/02/2022 00:00:00|-|04/19/2022 00:00:00|-|2|-|456789|-|12|-|
98765|-|D|-|J|-|N|-|6.7800|-|0.0000|-|0.0000|-|6.7800|-|12345.6700|-|J|-| |-|J|-| |-||-|6.7800|-|J|-|04/19/2022 10:00:00|-|JOHNDOE|-||-||-|0.0000
123456785|-|"|-|04/17/2022 00:00:00|-|05/01/2022 00:00:00|-|04/17/2022 00:00:00|-|11|-|0|-|0|-||-|D|-|N|-|N|-|0.0000|-|0.0000|-|0.0000|-|8.7600|-|54321.1200|-|N|-|N|-|J|-|N|-||-|0.0000|-|N|-|04/15/2022 13:45:00|-|ABC001      |-||-||-|0.0000
123456787|-|'|-|04/20/2022 00:00:00|-|05/03/2022 00:00:00|-|04/19/2022 00:00:00|-|2|-|345678|-|12|-|87654|-|D|-|J|-|N|-|4.7800|-|0.0000|-|0.0000|-|4.7800|-|34567.8900|-|J|-| |-|J|-| |-||-|4.7800|-|J|-|04/20/2022 10:00:00|-|DEF003|-||-||-|0.0000

You cannot see it here, but for the line that starts with 123456782, there is a carriage return behind |-|12|-| and before 98765|-|D| . In Notepad with "Show End of Line" symbols it shows CR between these values and all the lines end with CRLF.

When I read this data with Pandas read_csv, it regards the second half of the the second line as a new line. I've tried different settings of the lineterminator argument. No dice. lineterminator="\n", lineterminator="\r" and lineterminator="\r\n" all produce the same results.

Example Pandas read_csv statement I've used:

import pandas as pd
import re

csvfile = "D:\Data\file_20220412.dat"
separator = "|-|"
separator = re.escape(separator)
df = pd.read_csv(csvfile, sep=separator, header="infer", encoding="latin-1", lineterminator="\n")

So colleagues have suggested to use csv.reader instead. First problem with that: no multichar delimiters allowed! But we can work around that with a replace:

data = csv.reader((line.replace('|-|', ',') for line in csvfile), delimiter=",")

Then weird stuff happens. It reads some of the data correctly in 29 columns, but some rows are put in 2 columns. As you can see some of the rows contain a double quote sign in the second column. When you read the data with csv.reader and then write it to an output file it turns out extra double quotes are added. Which makes the rest of the code regard it as 2 columns.

"123456785|-|""|-|04/17/2022 00:00:00|-|05/01/2022 00:00:00|-|04/17/2022 00:00:00|-|11|-|0|-|0|-||-|D|-|N|-|N|-|0.0000|-|0.0000|-|0.0000|-|8.7600|-|54321.1200|-|N|-|N|-|J|-|N|-||-|0.0000|-|N|-|04/15/2022 13:45:00|-|ABC001      |-||-||-|0.0000"

I really hope I don't have to do some kind of regex solution to solve the extra double quote stuff csv.reader introduced.

I'm looking for two possible solutions:

(Which I prefer) Make Pandas read_csv stop interpreting the carriage return as a line break.
Make csv.reader stop adding extra double quotes.

Any ideas?

CodePudding user response：

I've found a possible solution. I managed to open the file in Python without any csv treatment and this replaces the carriage returns. The newline="\n" made it possible to pick on the \r characters next to delimiters.

separator="|-|"
data = open(csvfile, "r", encoding="latin-1", newline="\n")
data = data.read()
data = re.sub(r"\|\-\|[\r]", separator, data)
data = re.sub('\|\-\|', ',', data)
print(data)

Now I could put this in read_csv and it should be treated as a regular csv file.

Maybe I don't even have to scan on carriage returns next to delimiters (because now I'm realizing they don't have to always be there in that column).

I'll try more next week.

CodePudding user response：

You could try using the newline='' mode to keep the newline characters. These can then be removed easily. Secondly use quoting=csv.QUOTE_NONE to disable quote processing. Lastly remove any columns seen with just -.

import pandas as pd
import io
import csv

rows = []

with open('e_carriagereturn_20220430.dat', newline='') as f_input:
    data = f_input.read().replace('\x0d', '')
    csv_input = csv.reader(io.StringIO(data), delimiter='|', quoting=csv.QUOTE_NONE)
    
    for row in csv_input:
        rows.append([value for value in row if value != '-'])
        
df = pd.DataFrame(rows)        
print(df)

Giving:

            0  1                    2                    3                    4   5        6   7               8  9  10 11        12      13      14       15            16 17 18 19 20 21        22 23                   24             25 26 27      28
0   752296019  "  04/15/2022 00:00:00  04/28/2022 00:00:00  04/15/2022 00:00:00  13        0   0                  A  J  J    0.0000  0.0000  0.0000   1.2300   123456.2700  J     J        -23.4500  N  04/19/2022 12:00:41  AEINSTEIN1           0.0000
1   752296020  "  03/31/2022 00:00:00  04/13/2022 00:00:00  03/31/2022 00:00:00   1   359542  12       318047.01  A  J  J  543.2100  0.0000  0.0000  32.1000   244680.4400  J     J        543.2100  J  04/01/2022 12:44:42       PKDICK1         0.0000
2   752296032  !  04/08/2022 00:00:00  04/22/2022 00:00:00  04/08/2022 00:00:00   2   222856  12           54321  A  J  N   26.8700  0.0000  0.0000   1.2800    38068.8800  J  N  J  N      26.8700  J  04/06/2022 12:00:32   ABC003              0.0000
3   752296044  "  04/19/2022 00:00:00  05/02/2022 00:00:00  04/19/2022 00:00:00   2   222857  12           34877  D  J  N    6.7800  0.0000  0.0000   6.7800   122345.3500  J     J          6.7800  J  04/19/2022 12:00:49        WGIBSON        0.0000
4   752296098  !  04/17/2022 00:00:00  05/01/2022 00:00:00  04/17/2022 00:00:00  13        0   0                  D  N  N    0.0000  0.0000  0.0000   8.7000    79689.4800  N  N  J  N       0.0000  N  04/15/2022 12:24:58   ABC003              0.0000
5   431807560  "  04/12/2022 00:00:00  04/21/2022 00:00:00  04/12/2022 00:00:00   5        0   0                  D  J  N   16.9600  0.0000  0.0000   0.8500    10919.6900  J     J         16.7800  N  04/13/2022 14:49:44     FHERBERT          0.0000
6   431807563  !  04/17/2022 00:00:00  05/01/2022 00:00:00  04/17/2022 00:00:00  11        0   0                  D  N  N    0.0000  0.0000  0.0000   2.6700    31790.1600  N  N  J  N       0.0000  N  04/15/2022 12:44:56   ABC003              0.0000
7   431807594  "  03/28/2022 00:00:00  04/11/2022 00:00:00  03/28/2022 00:00:00   1   580807  12  12345AB12345AB  D  J  J  193.8200  0.0000  0.0000  19.3800   276921.4800  J     J        193.8200  J  03/29/2022 12:00:38        WGIBSON        0.0000
8   431807597  "  04/19/2022 00:00:00  05/02/2022 00:00:00  04/19/2022 00:00:00   1   107348  12     12.45671/AB  D  J  J    6.7800  0.0000  0.0000   6.7800    87133.8200  J     J          6.7800  J  04/15/2022 12:22:35      UKLEGUIN         0.0000
9   679785779  "  03/18/2022 00:00:00  04/01/2022 00:00:00  03/18/2022 00:00:00  13        0   0                  B  N  N    0.0000  0.0000  0.0000   9.3300   142940.7700  N  N  J  N       0.0000  N  04/20/2022 08:04:02     AHUXLEY           0.0000
10  679785789  !  04/15/2022 00:00:00  04/29/2022 00:00:00  04/15/2022 00:00:00   2  4876321  12       488250/CD  D  J  N  876.5800  0.0000  0.0000  16.7800   200604.8900  J  N  J  N     876.5400  J  04/13/2022 12:28:49   ABC003              0.0000
11  665661904  !  04/15/2022 00:00:00  04/29/2022 00:00:00  04/15/2022 00:00:00   2   394132  12        46409 EF  D  J  N  567.9800  0.0000  0.0000   9.1600   513561.4600  J  N  J  N     567.8700  J  04/13/2022 12:24:37   ABC003              0.0000
12  665661909  "  03/25/2022 00:00:00  04/01/2022 00:00:00  03/25/2022 00:00:00  14   216308  12      97745894XY  D  J  J    0.0000  0.0000  0.0000  11.4500   208666.1300  J     J          0.0000  J  03/25/2022 12:25:03     FHERBERT          0.0000
13  665661934  !  04/19/2022 00:00:00  05/02/2022 00:00:00  04/19/2022 00:00:00   2   627911  12     abc/21.4177  D  J  N   54.3200  0.0000  0.0000  23.4500   333689.0000  J  N  J  N      54.3200  J  04/14/2022 23:15:20   ABC003              0.0000
14  665661945  !  03/25/2022 00:00:00  04/07/2022 00:00:00  03/25/2022 00:00:00   1  3074312  12      923088/ABC  D  J  J  199.2600  0.0000  0.0000  14.5600   850785.1500  J  N  J  N     189.0120  J  03/25/2022 11:48:55   ABC003              0.0000
15  665661965  !  04/22/2022 00:00:00  05/06/2022 00:00:00  04/22/2022 00:00:00   1   627921  12           27160  D  J  J  567.3400  0.0000  0.0000  45.6800  2252133.2900  J  N  J  N     567.3400  J  04/20/2022 12:43:09   ABC003              0.0000
16  665661976  !  04/22/2022 00:00:00  05/06/2022 00:00:00  04/22/2022 00:00:00   2   627942  12       1734793zy  D  J  N  223.4800  0.0000  0.0000  23.4500   416715.9100  J     J        234.5600  J  04/21/2022 12:04:19   ABC003              0.0000
17  665661978  !  04/29/2022 00:00:00  05/13/2022 00:00:00  04/29/2022 00:00:00   2   627998  12        44524 fg  D  J  N  226.3000  0.0000  0.0000   5.3700   162912.0800  J  N  J  N     234.2000  J  04/21/2022 12:12:44   ABC003              0.0000
18  665661987  "  04/07/2022 00:00:00  04/19/2022 00:00:00  04/07/2022 00:00:00  14        0   0                  D  J  J   78.6500  0.0000  0.0000   1.3400    56249.8400  N     J         78.6500  N  04/08/2022 12:32:28       PKDICK1         0.0000

This seems to be a fairly good starting point. The rows all have a consistent number of values. You could also combine it with your regular expression to process the weird delimiters.