Home > Back-end >  How to read csv file into pandas, skipping rows until a certain string, then selecting first row aft
How to read csv file into pandas, skipping rows until a certain string, then selecting first row aft

Time:12-21

Below is an example of a bunch of csv files / datasets I have. They follow the format below.

FILE-START
COL=yes
DELIMITER=|
.....
.....
.....

START-DATA
header1 | header2 | header3 | header4 | header5
data1   | data2   | data3   | data4   | data5
......
......

I need to skip headers until the string START-DATA because each file has a different number of rows before the data starts.

To load this file I can do pd.read_csv(filename, skiprows=50, delimiter = '|'), but if I want to do a bulk load of all the files this won't work as the data starts at a different row each time.

How can I make this code dynamic so that it starts when the START-DATA string appears and takes the following row as a header and uses | as a delimiter.

I have tried the following code so far,

df = pd.read_csv(downloaded_file, header=None)
df_2 = df.iloc[(df.loc[df[0]=='START-OF-DATA'].index[0] 1):, :].reset_index(drop = True)

I start with a dataframe that looks like this,

                                             0
0 header1 | header2 | header3 | header4 | header5
1 data1   | data2   | data3   | data4   | data5

Where all the data is squeezed into one column. How do I use | as the delimiter next?

What's the correct and most efficient way to do this dynamically for each file?

CodePudding user response:

I figured it out thanks to the comments, I will post this answer here until someone can figure out a different or more efficient way.

I have added comments into each line of code.

I could not figure out a way to do it during the csv read and currently have to use 2 dataframes achieve what I need to do.

df = pd.read_csv(filename, header=None) #read all data
df_2 = df.iloc[(df.loc[df[0]=='START-DATA'].index[0] 1):, :].reset_index(drop = True) # make the row after found string as the first row
df_2 = df_2[0].str.split('|', expand=True) # split delimiter using string split.
new_header = df_2.iloc[0] # grab the first row for the header
df_2 = df_2[1:] # take the data less the header row
df_2.columns = new_header #rename columns as the previous first row

CodePudding user response:

This might also work. It reads the df and skips a nr of rows

df = pd.read_csv(filename, skip_blank_lines = False)
start =  df.loc[df.FILE-START == 'START-DATA'].index[0]
df = pd.read_csv(filename, skiprows = start   2)

CodePudding user response:

You can parse the file with open() search for "START-DATA" and then pass the file-object to pandas.read_csv:

import pandas as pd

with open("FILE-START.txt", "r", encoding="utf-8") as f:
    while line := f.readline():
        if line.strip() == "START-DATA":
            break
    
    df = pd.read_csv(f, delimiter="|")
  • Related