Reading csv file into pandas dataframe from certain string to certain string-CodePudding

I have a .csv file with data from different runs. So, my file looks like:

execution on Monday
name1, name2, name3
1,2,3
4,5,6
end Monday

execution on Tuesday
name1, name2, name3
7,8,9
10,11,12
end Tuesday

Thereby, it is unknown how many rows there are for one run.

I am looking for an efficient way to read only the rows from a specific run into a pandas DataFrame. Right now, I read the file, compare each line with the searched string and return the number of the start and end line. Then I do pandas.read_csv with skiprow to start from the found start line and afterwards delete everything below the end line.

Edit: The ouput I wish for would be a DataFrame which looks like:

name1 name2 name3
7  8  9
10 11 12

My current code is:

import pandas as pd

filename="test.csv"
durchgang="Tuesday"

def findstart(filename,durchgang):
    file=open(filename,"r")
    start=0
    for line in file:
        start =1
        if line[len("execution on "):len("execution on ") len(durchgang)] == durchgang:
            return line, start
        else:
            pass
    file.close()

def findend(filename,durchgang):
    file=open(filename,"r")
    end=0
    for line in file:
        end =1
        elif line[0:len("end ") len(durchgang)] == "end " durchgang:
            return line, end

start=findstart(filename,durchgang)[1]
end=findend(filename, durchgang)[1]

df=pd.read_csv(filename, header=1, skiprows=start)
df.drop(index=[end:len(df.index)])

Both, deleting the not used rows and first searching through the file with readline seems very unelegant. Is there a better solution that does not go through the file twice. I saw people using the query condition to filter, but this needs an index.

CodePudding user response：

Not really sure if this is what you are looking for, but you could consider adding a MultiIndex to the dataframe you are storing. So your index levels could look something like ['run_id', 'usual_id']. For each run you would generate an unique index and then when you wish to store your results you append the current run dataframe to the existing csv file following this post. This will provide you pretty convenient way of looking through your runs since you could load them all in a single dataframe during analysis.

CodePudding user response：

It seems you have empty line betwin runs - so you could use this fact.

And you can use fact that pandas can read from file-like-object

You can read all as single string and split on empty lines

runs = text.split("\n\n")

and use selected part with module io to create file-like-object in memory

file_like_object = io.StringIO(selected_text)

and read from this file

pd.read_csv( file_like_object, ...)

Minimal working code

I use example text directly in code (so everyone can copy and test it) but should use open() and text = read()

text = '''execution on Monday
name1, name2, name3
1,2,3
4,5,6
end Monday

execution on Tuesday
name1, name2, name3
7,8,9
10,11,12
end Tuesday

execution on Tuesday
name1, name2, name3
107,108,109
110,111,112
end Tuesday'''

import io
import pandas as pd

# read all as single string
#with open("input.csv") as f:
#    text = f.read()

# split on empty lines
runs = text.split('\n\n')

# get selected `run` 
selected = runs[1]

# create file-like-object (file in memory)
file_object = io.StringIO(selected)

# read from file-like-object
df = pd.read_csv(file_object , skiprows=1, skipfooter=1)

print(df)