Splitting columns from a .txt file using Pandas Dataframes-CodePudding

I have a .txt file that looks like this-

Patient ID: 5c
Recording Date: 3/21

Events Included:
SLEEP-MT
SLEEP-REM
SLEEP-S0
SLEEP-S1
SLEEP-S2
SLEEP-S3

Scoring Session:
Scorer Name: DZ


Sleep Stage Time [hh:mm:ss] Event   Duration[s]
SLEEP-S0    23:27:14    SLEEP-S0    30
SLEEP-S0    23:27:44    SLEEP-S0    30
SLEEP-MT    23:28:14    SLEEP-MT    30
SLEEP-S0    23:28:44    SLEEP-S0    30
SLEEP-S0    23:29:14    SLEEP-S0    30
SLEEP-S0    23:29:44    SLEEP-S0    30
SLEEP-S0    23:30:14    SLEEP-S0    30
SLEEP-S0    23:30:44    SLEEP-S0    30

How do I just read/store the columns in dataframes? I only want the 4 columns:

Sleep Stage Time[hh:mm:ss] Event   Duration[s]
SLEEP-S0    23:27:14    SLEEP-S0    30
SLEEP-S0    23:27:44    SLEEP-S0    30
SLEEP-MT    23:28:14    SLEEP-MT    30
SLEEP-S0    23:28:44    SLEEP-S0    30
SLEEP-S0    23:29:14    SLEEP-S0    30
SLEEP-S0    23:29:44    SLEEP-S0    30
SLEEP-S0    23:30:14    SLEEP-S0    30
SLEEP-S0    23:30:44    SLEEP-S0    30

If I can read/store the above with the Patient ID(first line), that would be nice too. Whatever I have tried so far is simply merging all the column values together.

CodePudding user response：

# read the csv and skip the first 13 rows
# using python as engine so we can use regex and combine multiple spaces into single break
df=pd.read_csv(r'c:\text.txt' ,skiprows=13 , sep='\s ', engine='python')

# due to the spaces in the names of 'Sleep Stage' and 'Time[hh:mm:ss]', these splits into separate columns

# take the first 4 columns
df=df.iloc[:,:4]

# rename these
df.columns=[['Sleep Stage','Time[hh:mm:ss]', 'Event','Duration[s]']]
df

Sleep Stage     Time[hh:mm:ss]  Event   Duration[s]
0   SLEEP-S0    23:27:14    SLEEP-S0    30
1   SLEEP-S0    23:27:44    SLEEP-S0    30
2   SLEEP-MT    23:28:14    SLEEP-MT    30
3   SLEEP-S0    23:28:44    SLEEP-S0    30
4   SLEEP-S0    23:29:14    SLEEP-S0    30
5   SLEEP-S0    23:29:44    SLEEP-S0    30
6   SLEEP-S0    23:30:14    SLEEP-S0    30
7   SLEEP-S0    23:30:44    SLEEP-S0    30

alternate solution to capture patientid from line #1

# read in the text file
df2=pd.read_csv(r'c:\text.txt' , header=None) 

# drop the middle lines - not needed per OP needs
df2=df2.drop(index=range(1,12))

# capture the patient id from row 1
df2['patientid'] = df2.iloc[0].str.split(' ',expand=True)[2]


#define the column names
cols= ['Sleep Stage','Time[hh:mm:ss]', 'Event','Duration[s]']

# split the data with whitspaces (\s )
df2[cols]=df2[0].str.split('\s ', expand=True)

# makes column values null in row where we have patient id
df2[cols]=df2[cols].mask(df2['patientid'].notna())

# downfill patient id
df2=df2.drop(columns=0).ffill()
df2

     patientid  Sleep Stage     Time[hh:mm:ss]  Event   Duration[s]
0           5c          NaN                NaN  NaN         NaN
12          5c     SLEEP-S0           23:27:14  SLEEP-S0    30
13          5c     SLEEP-S0           23:27:44  SLEEP-S0    30
14          5c     SLEEP-MT           23:28:14  SLEEP-MT    30
15          5c     SLEEP-S0           23:28:44  SLEEP-S0    30
16          5c     SLEEP-S0           23:29:14  SLEEP-S0    30
17          5c     SLEEP-S0           23:29:44  SLEEP-S0    30
18          5c     SLEEP-S0           23:30:14  SLEEP-S0    30
19          5c     SLEEP-S0           23:30:44  SLEEP-S0    30

CodePudding user response：

# Read Text Files with Pandas using read_csv()

# importing pandas
import pandas as pd

# read text file into pandas DataFrame
df = pd.read_csv("gfg.txt", sep=" ")

# display DataFrame
print(df)