I'm trying to import csv style data from a software designed in Europe into a df for analysis. The data uses two characters to delimit the data in the files, 'DC4' and 'SI' ("Shift In" I believe). I'm currently concatenating the files and delimiting them by the 'DC4' character using read_csv into a df. Then I use a regex line to replace all the 'SI' characters into ';' in the df. I skip every other line in the code to remove the identifiers I don't need next. If I open the data at this point everything is split by the 'DC4' and all 'SI' are converted to ;.
What would you suggest to further split the df by the ; character now? I've tried to split the df by series.string but got type errors. I've exported to csv and reimported it using ; as the delimiter, but it doesn't split the existing columns that were already split with the first import for some reason? I also get parser errors on some rows way down the df so I think there are dirty rows (this is just information I've found. If not helpful please ignore it). I can ignore these lines without affecting the data I need.
The size of the df is around 60-70 columns and usually less than 75K rows when I pull a full report. I'm using PyCharm and Python 3.8. Thank you all for any help on this, I very much appreciate it. Here is my code so far:
path = file directory location
df = pd.concat([pd.read_csv(f, sep='', comment=" ", na_values='Nothing', header=None, index_col=False)
for f in glob.glob(path ".file extension")], ignore_index=True)
df = df.replace('', ';', regex=True)
df = df.iloc[::2]
df.to_csv(r'new_file_location', index=False, encoding='utf-8-sig')
CodePudding user response:
So you have a CSV (technically not a CSV I guess) that's separated by two different values (DC4 and SI) and you want to read it into a dataframe?
You can do so directly with pandas, the read_csv
function allows you to specify regex delimiters, so you could use "\x0e|\x14"
and use either DC4 or SI as selarator: pd.read_csv(path, sep="\x0e|\x14")
An example with readable characters:
The csv contains:
col1,col2;col3
val1,val2,val3
val4;val5;val6
Which can be read as follows:
import pandas as pd
df = pd.read_csv(path, sep=",|;")
which results in df
being:
col1 col2 col3
0 val1 val2 val3
1 val4 val5 val6