Home > Mobile >  Pandas read_csv() from text file where data begin/end are marked by specific strings
Pandas read_csv() from text file where data begin/end are marked by specific strings

Time:02-01

I'm reading hundreds of model outputs from text file, where the first nrows are non-relevant text rows about the model run (note: nrows varies from file to file). However, I have comma separated data that I want to import from the text file. This data can be found following the line "BREAK THROUGH @ WT, ITERATION" and "END BREAK THROUGH @ WT" in all of the files (see below). My current approach of using nrows and skiprows in read_csv() doesnt work because these parameters vary from file to file. Any thoughts on how you can import csv data from text files using string "markers". Thanks!
Model output/Input files I want to read look like
text
text 0.314347435514229
text text text text text text text
BREAK THROUGH @ WT, ITERATION
1 0.0
3 0.0
6 0.0
END BREAK THROUGH @ WT
The extracted data in the dataframe would look like
1 0.0
3 0.0
6 0.0

CodePudding user response:

Using fake data with a column named "your_column":

words = ["BREAK THROUGH @ WT, ITERATION", "END BREAK THROUGH @ WT"]

df = pd.read_csv(...)
df = df.loc[df["your_column"].isin(words).cumsum() & ~df["your_column"].isin(words)].reset_index(drop=True)
print(df)

CodePudding user response:

Seems like I was able to find a solution without regex but still curious how regex could have simplified my life.

beg_id = "BREAK THROUGH @ WT, ITERATION =     1\n"
end_id = "END BREAK THROUGH @ WT"
# for f in cmtp_fnames:
f = 'data/cmtp/PFOS_Dry_LS_1m_AD R.OUT'
with open(f) as fname:
    data = fname.read()
    data = data[data.find(beg_id):]
    data = data[data.find(beg_id) len(beg_id):data.find(end_id)]
    data=data.splitlines(False)
    data=pd.DataFrame(sub.split(",") for sub in data).drop(labels=2,axis=1)
  • Related