Below is an example of a bunch of csv files / datasets I have. They follow the format below.
FILE-START
COL=yes
DELIMITER=|
.....
.....
.....
START-DATA
header1 | header2 | header3 | header4 | header5
data1 | data2 | data3 | data4 | data5
......
......
I need to skip headers until the string START-DATA
because each file has a different number of rows before the data starts.
To load this file I can do pd.read_csv(filename, skiprows=50, delimiter = '|')
, but if I want to do a bulk load of all the files this won't work as the data starts at a different row each time.
How can I make this code dynamic so that it starts when the START-DATA
string appears and takes the following row as a header and uses |
as a delimiter.
I have tried the following code so far,
df = pd.read_csv(downloaded_file, header=None)
df_2 = df.iloc[(df.loc[df[0]=='START-OF-DATA'].index[0] 1):, :].reset_index(drop = True)
I start with a dataframe that looks like this,
0
0 header1 | header2 | header3 | header4 | header5
1 data1 | data2 | data3 | data4 | data5
Where all the data is squeezed into one column. How do I use |
as the delimiter next?
What's the correct and most efficient way to do this dynamically for each file?
CodePudding user response:
I figured it out thanks to the comments, I will post this answer here until someone can figure out a different or more efficient way.
I have added comments into each line of code.
I could not figure out a way to do it during the csv read and currently have to use 2 dataframes achieve what I need to do.
df = pd.read_csv(filename, header=None) #read all data
df_2 = df.iloc[(df.loc[df[0]=='START-DATA'].index[0] 1):, :].reset_index(drop = True) # make the row after found string as the first row
df_2 = df_2[0].str.split('|', expand=True) # split delimiter using string split.
new_header = df_2.iloc[0] # grab the first row for the header
df_2 = df_2[1:] # take the data less the header row
df_2.columns = new_header #rename columns as the previous first row
CodePudding user response:
This might also work. It reads the df and skips a nr of rows
df = pd.read_csv(filename, skip_blank_lines = False)
start = df.loc[df.FILE-START == 'START-DATA'].index[0]
df = pd.read_csv(filename, skiprows = start 2)
CodePudding user response:
You can parse the file with open() search for "START-DATA" and then pass the file-object to pandas.read_csv:
import pandas as pd
with open("FILE-START.txt", "r", encoding="utf-8") as f:
while line := f.readline():
if line.strip() == "START-DATA":
break
df = pd.read_csv(f, delimiter="|")