The problem
I have several text files with text on the following format. The number of lines beginning with "CONTOUR" is different for each "OBJECT" block. I've uploaded a sample file to my Github page (LINK). For each section that begins with "OBJECT" I need to extract each line beginning with "CONTOUR" and import these into a pandas dataframe. After processing I aim to have a dataframe with three columns with Object, Points, and Length as headers.
Example code:
OBJECT 1
NAME: MT1(SP1)
3 contours
object uses open contours.
color (red, green, blue) = (0, 1, 0)
CONTOUR #1,1,0 5 points length = 3.07e 006 pm
CONTOUR #2,1,0 6 points length = 3.51e 006 pm
CONTOUR #3,1,0 5 points length = 3.50e 006 pm
OBJECT 2
NAME: MT2(SP3)
4 contours
object uses open contours.
color (red, green, blue) = (0, 1, 1)
CONTOUR #1,2,0 4 points length = 1.86e 006 pm
CONTOUR #2,2,0 4 points length = 2.29e 006 pm
CONTOUR #3,2,0 5 points length = 2.47e 006 pm
CONTOUR #3,2,0 5 points length = 2.47e 006 pm
OBJECT 3
NAME: MT3(SP2)
1 contours
object uses open contours.
color (red, green, blue) = (1, 0, 1)
CONTOUR #1,3,0 6 points length = 2.74e 006 pm
Example result:
Object | Points | Length
1 | 5 | 3.07e 006
1 | 6 | 3.51e 006
1 | 5 | 3.50e 006
2 | 4 | 1.86e 006
2 | 4 | 2.29e 006
2 | 5 | 2.47e 006
2 | 5 | 2.47e 006
3 | 6 | 2.74e 006
What I've tried
Through strip() I've managed to extract one block between "OBJECT 1" and "OBJECT 2". I am now experimenting with using some kind of counter to extrapolate this to the following block (2-3, 3-4 etc.) but am unsure on how to proceed. This approach also carries the problem that the last block will not be read (as it lacks a final "OBJECT").
with open('textfile.txt', 'r') as input, open('new_textfile.txt', 'w') as output:
for line in input:
if line.strip() == "OBJECT 1":
copy = True
continue
elif line.strip() == "OBJECT 2":
copy = False
continue
elif copy:
output.write(line)
input.close()
output.close()
I've also played around using regular expression, but run into the same problem on how to identify and import each block individually into a pandas dataframe.
pattern = 'OBJECT\s\d.*OBJECT\s\d'
match = re.findall(pattern, text, re.DOTALL)
Any help or pointers on how to proceed with this problem is greatly appreciated! Please ask for clarifications if needed.
CodePudding user response:
You can use this code:
with open('test_sample.txt') as datfile:
data = []
for line in datfile:
line = line.strip()
if line.startswith('OBJECT'):
obj_id = int(line.split()[1])
elif line.startswith('CONTOUR'):
args = line.split()
points = int(args[2])
length = float(args[6])
data.append({'Object': obj_id,
'Points': points,
'Length': length})
df = pd.DataFrame(data)
Output (tested with your github sample):
>>> df
Object Points Length
0 1 5 3074020.0
1 1 6 3511060.0
2 1 5 3509020.0
3 1 5 3505450.0
4 1 5 3423030.0
.. ... ... ...
166 16 4 1461300.0
167 16 2 1372990.0
168 16 3 1471150.0
169 16 3 1392340.0
170 16 4 1381150.0
[171 rows x 3 columns]
CodePudding user response:
Try this:
df = pd.DataFrame(columns=["Object", "Points", "Length"])
with open('textfile.txt', 'r') as input, open('new_textfile.txt', 'w') as output:
for line in input:
if "OBJECT" in line.strip():
object_num = int(line.strip()[7:])
if "CONTOUR" in line.strip():
index_points = line.split(" ").index("points")-1
index_length = line.split(" ").index("length") 2
df.loc[len(df)] = {"Object": object_num, "Points": int(line.split(" ")[index_points]),
"Length":"{:.2E}".format(float(line.split(" ")[index_length]))}
input.close()
output.close()
print(df)
Output:
Object Points Length
0 1 5 3.07E 06
1 1 6 3.51E 06
2 1 5 3.50E 06
3 2 4 1.86E 06
4 2 4 2.29E 06
5 2 5 2.47E 06
6 2 5 2.47E 06
7 3 6 2.74E 06