I have a .txt file that looks like this-
Patient ID: 5c
Recording Date: 3/21
Events Included:
SLEEP-MT
SLEEP-REM
SLEEP-S0
SLEEP-S1
SLEEP-S2
SLEEP-S3
Scoring Session:
Scorer Name: DZ
Sleep Stage Time [hh:mm:ss] Event Duration[s]
SLEEP-S0 23:27:14 SLEEP-S0 30
SLEEP-S0 23:27:44 SLEEP-S0 30
SLEEP-MT 23:28:14 SLEEP-MT 30
SLEEP-S0 23:28:44 SLEEP-S0 30
SLEEP-S0 23:29:14 SLEEP-S0 30
SLEEP-S0 23:29:44 SLEEP-S0 30
SLEEP-S0 23:30:14 SLEEP-S0 30
SLEEP-S0 23:30:44 SLEEP-S0 30
How do I just read/store the columns in dataframes? I only want the 4 columns:
Sleep Stage Time[hh:mm:ss] Event Duration[s]
SLEEP-S0 23:27:14 SLEEP-S0 30
SLEEP-S0 23:27:44 SLEEP-S0 30
SLEEP-MT 23:28:14 SLEEP-MT 30
SLEEP-S0 23:28:44 SLEEP-S0 30
SLEEP-S0 23:29:14 SLEEP-S0 30
SLEEP-S0 23:29:44 SLEEP-S0 30
SLEEP-S0 23:30:14 SLEEP-S0 30
SLEEP-S0 23:30:44 SLEEP-S0 30
If I can read/store the above with the Patient ID(first line), that would be nice too. Whatever I have tried so far is simply merging all the column values together.
CodePudding user response:
# read the csv and skip the first 13 rows
# using python as engine so we can use regex and combine multiple spaces into single break
df=pd.read_csv(r'c:\text.txt' ,skiprows=13 , sep='\s ', engine='python')
# due to the spaces in the names of 'Sleep Stage' and 'Time[hh:mm:ss]', these splits into separate columns
# take the first 4 columns
df=df.iloc[:,:4]
# rename these
df.columns=[['Sleep Stage','Time[hh:mm:ss]', 'Event','Duration[s]']]
df
Sleep Stage Time[hh:mm:ss] Event Duration[s]
0 SLEEP-S0 23:27:14 SLEEP-S0 30
1 SLEEP-S0 23:27:44 SLEEP-S0 30
2 SLEEP-MT 23:28:14 SLEEP-MT 30
3 SLEEP-S0 23:28:44 SLEEP-S0 30
4 SLEEP-S0 23:29:14 SLEEP-S0 30
5 SLEEP-S0 23:29:44 SLEEP-S0 30
6 SLEEP-S0 23:30:14 SLEEP-S0 30
7 SLEEP-S0 23:30:44 SLEEP-S0 30
alternate solution to capture patientid from line #1
# read in the text file
df2=pd.read_csv(r'c:\text.txt' , header=None)
# drop the middle lines - not needed per OP needs
df2=df2.drop(index=range(1,12))
# capture the patient id from row 1
df2['patientid'] = df2.iloc[0].str.split(' ',expand=True)[2]
#define the column names
cols= ['Sleep Stage','Time[hh:mm:ss]', 'Event','Duration[s]']
# split the data with whitspaces (\s )
df2[cols]=df2[0].str.split('\s ', expand=True)
# makes column values null in row where we have patient id
df2[cols]=df2[cols].mask(df2['patientid'].notna())
# downfill patient id
df2=df2.drop(columns=0).ffill()
df2
patientid Sleep Stage Time[hh:mm:ss] Event Duration[s]
0 5c NaN NaN NaN NaN
12 5c SLEEP-S0 23:27:14 SLEEP-S0 30
13 5c SLEEP-S0 23:27:44 SLEEP-S0 30
14 5c SLEEP-MT 23:28:14 SLEEP-MT 30
15 5c SLEEP-S0 23:28:44 SLEEP-S0 30
16 5c SLEEP-S0 23:29:14 SLEEP-S0 30
17 5c SLEEP-S0 23:29:44 SLEEP-S0 30
18 5c SLEEP-S0 23:30:14 SLEEP-S0 30
19 5c SLEEP-S0 23:30:44 SLEEP-S0 30
CodePudding user response:
# Read Text Files with Pandas using read_csv()
# importing pandas
import pandas as pd
# read text file into pandas DataFrame
df = pd.read_csv("gfg.txt", sep=" ")
# display DataFrame
print(df)