I'm working on a Python program that needs to read csv files that are produced on a Windows 2012 server machine. The aim of the Python code is to give a min/max of each column in the data, for data quality reasons. So I cannot change the source data.
The delimiter of the data file is |-|. Yes, it's a multi character delimiter. The data is encoded as latin-1. And recently I found out some of the columns can have carriage returns. And I also have to deal with that in my program.
Here is an example of the data:
123456781|-|!|-|04/08/2022 00:00:00|-|04/22/2022 00:00:00|-|04/08/2022 00:00:00|-|2|-|987654|-|12|-|12345|-|D|-|J|-|N|-|12.3400|-|0.0000|-|0.0000|-|1.2300|-|12345.6700|-|J|-|N|-|J|-|N|-||-|23.4500|-|J|-|04/06/2022 11:00:00|-|ABC001 |-||-||-|0.0000
123456782|-|"|-|04/19/2022 00:00:00|-|05/02/2022 00:00:00|-|04/19/2022 00:00:00|-|2|-|456789|-|12|-|
98765|-|D|-|J|-|N|-|6.7800|-|0.0000|-|0.0000|-|6.7800|-|12345.6700|-|J|-| |-|J|-| |-||-|6.7800|-|J|-|04/19/2022 10:00:00|-|JOHNDOE|-||-||-|0.0000
123456785|-|"|-|04/17/2022 00:00:00|-|05/01/2022 00:00:00|-|04/17/2022 00:00:00|-|11|-|0|-|0|-||-|D|-|N|-|N|-|0.0000|-|0.0000|-|0.0000|-|8.7600|-|54321.1200|-|N|-|N|-|J|-|N|-||-|0.0000|-|N|-|04/15/2022 13:45:00|-|ABC001 |-||-||-|0.0000
123456787|-|'|-|04/20/2022 00:00:00|-|05/03/2022 00:00:00|-|04/19/2022 00:00:00|-|2|-|345678|-|12|-|87654|-|D|-|J|-|N|-|4.7800|-|0.0000|-|0.0000|-|4.7800|-|34567.8900|-|J|-| |-|J|-| |-||-|4.7800|-|J|-|04/20/2022 10:00:00|-|DEF003|-||-||-|0.0000
You cannot see it here, but for the line that starts with 123456782, there is a carriage return behind |-|12|-| and before 98765|-|D| . In Notepad with "Show End of Line" symbols it shows CR between these values and all the lines end with CRLF.
When I read this data with Pandas read_csv, it regards the second half of the the second line as a new line. I've tried different settings of the lineterminator argument. No dice. lineterminator="\n", lineterminator="\r" and lineterminator="\r\n" all produce the same results.
Example Pandas read_csv statement I've used:
import pandas as pd
import re
csvfile = "D:\Data\file_20220412.dat"
separator = "|-|"
separator = re.escape(separator)
df = pd.read_csv(csvfile, sep=separator, header="infer", encoding="latin-1", lineterminator="\n")
So colleagues have suggested to use csv.reader instead. First problem with that: no multichar delimiters allowed! But we can work around that with a replace:
data = csv.reader((line.replace('|-|', ',') for line in csvfile), delimiter=",")
Then weird stuff happens. It reads some of the data correctly in 29 columns, but some rows are put in 2 columns. As you can see some of the rows contain a double quote sign in the second column. When you read the data with csv.reader and then write it to an output file it turns out extra double quotes are added. Which makes the rest of the code regard it as 2 columns.
"123456785|-|""|-|04/17/2022 00:00:00|-|05/01/2022 00:00:00|-|04/17/2022 00:00:00|-|11|-|0|-|0|-||-|D|-|N|-|N|-|0.0000|-|0.0000|-|0.0000|-|8.7600|-|54321.1200|-|N|-|N|-|J|-|N|-||-|0.0000|-|N|-|04/15/2022 13:45:00|-|ABC001 |-||-||-|0.0000"
I really hope I don't have to do some kind of regex solution to solve the extra double quote stuff csv.reader introduced.
I'm looking for two possible solutions:
- (Which I prefer) Make Pandas read_csv stop interpreting the carriage return as a line break.
- Make csv.reader stop adding extra double quotes.
Any ideas?
CodePudding user response:
I've found a possible solution. I managed to open the file in Python without any csv treatment and this replaces the carriage returns. The newline="\n" made it possible to pick on the \r characters next to delimiters.
separator="|-|"
data = open(csvfile, "r", encoding="latin-1", newline="\n")
data = data.read()
data = re.sub(r"\|\-\|[\r]", separator, data)
data = re.sub('\|\-\|', ',', data)
print(data)
Now I could put this in read_csv and it should be treated as a regular csv file.
Maybe I don't even have to scan on carriage returns next to delimiters (because now I'm realizing they don't have to always be there in that column).
I'll try more next week.
CodePudding user response:
You could try using the newline=''
mode to keep the newline characters. These can then be removed easily. Secondly use quoting=csv.QUOTE_NONE
to disable quote processing. Lastly remove any columns seen with just -
.
import pandas as pd
import io
import csv
rows = []
with open('e_carriagereturn_20220430.dat', newline='') as f_input:
data = f_input.read().replace('\x0d', '')
csv_input = csv.reader(io.StringIO(data), delimiter='|', quoting=csv.QUOTE_NONE)
for row in csv_input:
rows.append([value for value in row if value != '-'])
df = pd.DataFrame(rows)
print(df)
Giving:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
0 752296019 " 04/15/2022 00:00:00 04/28/2022 00:00:00 04/15/2022 00:00:00 13 0 0 A J J 0.0000 0.0000 0.0000 1.2300 123456.2700 J J -23.4500 N 04/19/2022 12:00:41 AEINSTEIN1 0.0000
1 752296020 " 03/31/2022 00:00:00 04/13/2022 00:00:00 03/31/2022 00:00:00 1 359542 12 318047.01 A J J 543.2100 0.0000 0.0000 32.1000 244680.4400 J J 543.2100 J 04/01/2022 12:44:42 PKDICK1 0.0000
2 752296032 ! 04/08/2022 00:00:00 04/22/2022 00:00:00 04/08/2022 00:00:00 2 222856 12 54321 A J N 26.8700 0.0000 0.0000 1.2800 38068.8800 J N J N 26.8700 J 04/06/2022 12:00:32 ABC003 0.0000
3 752296044 " 04/19/2022 00:00:00 05/02/2022 00:00:00 04/19/2022 00:00:00 2 222857 12 34877 D J N 6.7800 0.0000 0.0000 6.7800 122345.3500 J J 6.7800 J 04/19/2022 12:00:49 WGIBSON 0.0000
4 752296098 ! 04/17/2022 00:00:00 05/01/2022 00:00:00 04/17/2022 00:00:00 13 0 0 D N N 0.0000 0.0000 0.0000 8.7000 79689.4800 N N J N 0.0000 N 04/15/2022 12:24:58 ABC003 0.0000
5 431807560 " 04/12/2022 00:00:00 04/21/2022 00:00:00 04/12/2022 00:00:00 5 0 0 D J N 16.9600 0.0000 0.0000 0.8500 10919.6900 J J 16.7800 N 04/13/2022 14:49:44 FHERBERT 0.0000
6 431807563 ! 04/17/2022 00:00:00 05/01/2022 00:00:00 04/17/2022 00:00:00 11 0 0 D N N 0.0000 0.0000 0.0000 2.6700 31790.1600 N N J N 0.0000 N 04/15/2022 12:44:56 ABC003 0.0000
7 431807594 " 03/28/2022 00:00:00 04/11/2022 00:00:00 03/28/2022 00:00:00 1 580807 12 12345AB12345AB D J J 193.8200 0.0000 0.0000 19.3800 276921.4800 J J 193.8200 J 03/29/2022 12:00:38 WGIBSON 0.0000
8 431807597 " 04/19/2022 00:00:00 05/02/2022 00:00:00 04/19/2022 00:00:00 1 107348 12 12.45671/AB D J J 6.7800 0.0000 0.0000 6.7800 87133.8200 J J 6.7800 J 04/15/2022 12:22:35 UKLEGUIN 0.0000
9 679785779 " 03/18/2022 00:00:00 04/01/2022 00:00:00 03/18/2022 00:00:00 13 0 0 B N N 0.0000 0.0000 0.0000 9.3300 142940.7700 N N J N 0.0000 N 04/20/2022 08:04:02 AHUXLEY 0.0000
10 679785789 ! 04/15/2022 00:00:00 04/29/2022 00:00:00 04/15/2022 00:00:00 2 4876321 12 488250/CD D J N 876.5800 0.0000 0.0000 16.7800 200604.8900 J N J N 876.5400 J 04/13/2022 12:28:49 ABC003 0.0000
11 665661904 ! 04/15/2022 00:00:00 04/29/2022 00:00:00 04/15/2022 00:00:00 2 394132 12 46409 EF D J N 567.9800 0.0000 0.0000 9.1600 513561.4600 J N J N 567.8700 J 04/13/2022 12:24:37 ABC003 0.0000
12 665661909 " 03/25/2022 00:00:00 04/01/2022 00:00:00 03/25/2022 00:00:00 14 216308 12 97745894XY D J J 0.0000 0.0000 0.0000 11.4500 208666.1300 J J 0.0000 J 03/25/2022 12:25:03 FHERBERT 0.0000
13 665661934 ! 04/19/2022 00:00:00 05/02/2022 00:00:00 04/19/2022 00:00:00 2 627911 12 abc/21.4177 D J N 54.3200 0.0000 0.0000 23.4500 333689.0000 J N J N 54.3200 J 04/14/2022 23:15:20 ABC003 0.0000
14 665661945 ! 03/25/2022 00:00:00 04/07/2022 00:00:00 03/25/2022 00:00:00 1 3074312 12 923088/ABC D J J 199.2600 0.0000 0.0000 14.5600 850785.1500 J N J N 189.0120 J 03/25/2022 11:48:55 ABC003 0.0000
15 665661965 ! 04/22/2022 00:00:00 05/06/2022 00:00:00 04/22/2022 00:00:00 1 627921 12 27160 D J J 567.3400 0.0000 0.0000 45.6800 2252133.2900 J N J N 567.3400 J 04/20/2022 12:43:09 ABC003 0.0000
16 665661976 ! 04/22/2022 00:00:00 05/06/2022 00:00:00 04/22/2022 00:00:00 2 627942 12 1734793zy D J N 223.4800 0.0000 0.0000 23.4500 416715.9100 J J 234.5600 J 04/21/2022 12:04:19 ABC003 0.0000
17 665661978 ! 04/29/2022 00:00:00 05/13/2022 00:00:00 04/29/2022 00:00:00 2 627998 12 44524 fg D J N 226.3000 0.0000 0.0000 5.3700 162912.0800 J N J N 234.2000 J 04/21/2022 12:12:44 ABC003 0.0000
18 665661987 " 04/07/2022 00:00:00 04/19/2022 00:00:00 04/07/2022 00:00:00 14 0 0 D J J 78.6500 0.0000 0.0000 1.3400 56249.8400 N J 78.6500 N 04/08/2022 12:32:28 PKDICK1 0.0000
This seems to be a fairly good starting point. The rows all have a consistent number of values. You could also combine it with your regular expression to process the weird delimiters.