Home > Blockchain >  Extracting specific patterns from text file and importing into dataframe
Extracting specific patterns from text file and importing into dataframe

Time:11-30

The problem

I have several text files with text on the following format. The number of lines beginning with "CONTOUR" is different for each "OBJECT" block. I've uploaded a sample file to my Github page (LINK). For each section that begins with "OBJECT" I need to extract each line beginning with "CONTOUR" and import these into a pandas dataframe. After processing I aim to have a dataframe with three columns with Object, Points, and Length as headers.

Example code:

OBJECT 1
NAME:  MT1(SP1)
       3 contours
       object uses open contours.
       color (red, green, blue) = (0, 1, 0)

    CONTOUR #1,1,0  5 points    length = 3.07e 006 pm
    CONTOUR #2,1,0  6 points    length = 3.51e 006 pm
    CONTOUR #3,1,0  5 points    length = 3.50e 006 pm

OBJECT 2
NAME:  MT2(SP3)
       4 contours
       object uses open contours.
       color (red, green, blue) = (0, 1, 1)

    CONTOUR #1,2,0  4 points    length = 1.86e 006 pm
    CONTOUR #2,2,0  4 points    length = 2.29e 006 pm
    CONTOUR #3,2,0  5 points    length = 2.47e 006 pm
    CONTOUR #3,2,0  5 points    length = 2.47e 006 pm

OBJECT 3
NAME:  MT3(SP2)
       1 contours
       object uses open contours.
       color (red, green, blue) = (1, 0, 1)

    CONTOUR #1,3,0  6 points    length = 2.74e 006 pm

Example result:

Object | Points | Length
1 | 5 | 3.07e 006   
1 | 6 | 3.51e 006
1 | 5 | 3.50e 006
2 | 4 | 1.86e 006
2 | 4 | 2.29e 006
2 | 5 | 2.47e 006
2 | 5 | 2.47e 006
3 | 6 | 2.74e 006

What I've tried

Through strip() I've managed to extract one block between "OBJECT 1" and "OBJECT 2". I am now experimenting with using some kind of counter to extrapolate this to the following block (2-3, 3-4 etc.) but am unsure on how to proceed. This approach also carries the problem that the last block will not be read (as it lacks a final "OBJECT").

with open('textfile.txt', 'r') as input, open('new_textfile.txt', 'w') as output:
  for line in input:
        if line.strip() == "OBJECT 1":
            copy = True
            continue
        elif line.strip() == "OBJECT 2":
            copy = False
            continue
        elif copy:
            output.write(line)
            
input.close()
output.close()

I've also played around using regular expression, but run into the same problem on how to identify and import each block individually into a pandas dataframe.

pattern = 'OBJECT\s\d.*OBJECT\s\d'
match = re.findall(pattern, text, re.DOTALL)

Any help or pointers on how to proceed with this problem is greatly appreciated! Please ask for clarifications if needed.

CodePudding user response:

You can use this code:

with open('test_sample.txt') as datfile:
    data = []
    for line in datfile:
        line = line.strip()

        if line.startswith('OBJECT'):
            obj_id = int(line.split()[1])

        elif line.startswith('CONTOUR'):
            args = line.split()
            points = int(args[2])
            length = float(args[6])
            data.append({'Object': obj_id,
                         'Points': points,
                         'Length': length})

df = pd.DataFrame(data)

Output (tested with your github sample):

>>> df
     Object  Points     Length
0         1       5  3074020.0
1         1       6  3511060.0
2         1       5  3509020.0
3         1       5  3505450.0
4         1       5  3423030.0
..      ...     ...        ...
166      16       4  1461300.0
167      16       2  1372990.0
168      16       3  1471150.0
169      16       3  1392340.0
170      16       4  1381150.0

[171 rows x 3 columns]

CodePudding user response:

Try this:

df = pd.DataFrame(columns=["Object", "Points", "Length"])
with open('textfile.txt', 'r') as input, open('new_textfile.txt', 'w') as output:
    for line in input:
        if "OBJECT" in line.strip():
            object_num = int(line.strip()[7:])

        if "CONTOUR" in line.strip():
            index_points = line.split(" ").index("points")-1
            index_length = line.split(" ").index("length") 2
            df.loc[len(df)] = {"Object": object_num, "Points": int(line.split(" ")[index_points]), 
                            "Length":"{:.2E}".format(float(line.split(" ")[index_length]))}
            
input.close()
output.close()

print(df)

Output:

  Object Points    Length
0      1      5  3.07E 06
1      1      6  3.51E 06
2      1      5  3.50E 06
3      2      4  1.86E 06
4      2      4  2.29E 06
5      2      5  2.47E 06
6      2      5  2.47E 06
7      3      6  2.74E 06
  • Related