How to use 'Shift In' from text in csv file to split columns-CodePudding

I'm trying to import csv style data from a software designed in Europe into a df for analysis. The data uses two characters to delimit the data in the files, 'DC4' and 'SI' ("Shift In" I believe). I'm currently concatenating the files and delimiting them by the 'DC4' character using read_csv into a df. Then I use a regex line to replace all the 'SI' characters into ';' in the df. I skip every other line in the code to remove the identifiers I don't need next. If I open the data at this point everything is split by the 'DC4' and all 'SI' are converted to ;.

What would you suggest to further split the df by the ; character now? I've tried to split the df by series.string but got type errors. I've exported to csv and reimported it using ; as the delimiter, but it doesn't split the existing columns that were already split with the first import for some reason? I also get parser errors on some rows way down the df so I think there are dirty rows (this is just information I've found. If not helpful please ignore it). I can ignore these lines without affecting the data I need.

The size of the df is around 60-70 columns and usually less than 75K rows when I pull a full report. I'm using PyCharm and Python 3.8. Thank you all for any help on this, I very much appreciate it. Here is my code so far:

path = file directory location

df = pd.concat([pd.read_csv(f, sep='', comment=" ", na_values='Nothing', header=None, index_col=False)
                   for f in glob.glob(path   ".file extension")], ignore_index=True)

df = df.replace('', ';', regex=True)

df = df.iloc[::2]

df.to_csv(r'new_file_location', index=False, encoding='utf-8-sig')

CodePudding user response：

So you have a CSV (technically not a CSV I guess) that's separated by two different values (DC4 and SI) and you want to read it into a dataframe?

You can do so directly with pandas, the read_csv function allows you to specify regex delimiters, so you could use "\x0e|\x14" and use either DC4 or SI as selarator: pd.read_csv(path, sep="\x0e|\x14")

An example with readable characters:

The csv contains:

col1,col2;col3
val1,val2,val3
val4;val5;val6

Which can be read as follows:

import pandas as pd

df = pd.read_csv(path, sep=",|;")

which results in df being:

   col1  col2  col3
0  val1  val2  val3
1  val4  val5  val6