I have a large text file which contains information as follows:
0 / END OF ONE DATA, BEGIN SECOND DATA
361,315,0,'1 ',1,1,1,0,0,2,'NAT1 ',1,1115,1,0,0,0,0,0,0
0.0055501,0.12595,100
1,69,0,100,100,100,1,36,1.1,0.9,1.04283,1.001283,33,0,0,0, /*[name1 ]*/
0.975,138
481,417,0,'1 ',1,1,1,0,0,2,'KAT1 ',1,115,1,0,0,0,0,0,0
0.00762817,0.14163,60
1,69,0,60,60,60,1,48,1.1,0.9,1.011735,0.917735,33,0,0,0, /*[name2 ]*/
0 / END OF SECOND DATA, BEGIN THIRD DATA
I want to get the following in a dataframe:
name1
name2
I tried the following:
import os, pandas as pd
from io import StringIO
fn = r'C:\Users\asdert\Downloads\Network.RAW'
file = open(fn)
line = file.read() # .replace("\n", "$$$$$")
file.close()
start = line.find('END OF ONE DATA, BEGIN SECOND DATA') 1
end = line.find('END OF SECOND DATA, BEGIN THIRD DATA')
branchData = line[start:end]
df = pd.read_csv(StringIO(branchData), sep=r'\n')
I am not sure how to approach this. Basically I have to parse text between /*
and */
and ignore lines which don't have /*
and */
CodePudding user response:
You can do away with regex if you have a single name
per line:
import pandas as pd
names = []
filepath = "<PATH_TO_YOUR_FILE>"
with open(filepath, 'r') as f: # open file for reading line by line
for line in f: # read line by line
start = line.find('/*[') # get index of /*[ substring
end = line.find(']*/', start 3) # get index of ]*/ substring after found start 3 index
if start >= 0 and end >= 0: # if indices found OK
names.append(line[start 3:end].strip()) # Put the value in between into names list
df = pd.DataFrame({'names': names}) # init the dataframe
>>> df
# => names
# 0 name1
# 1 name2
Also, see this Python demo.