Home > Blockchain >  regex for data preparation and processing afterwards in python
regex for data preparation and processing afterwards in python

Time:10-27

I have a quiet big file of data, which is not in a really good state for further processing. So I want to regex the best out of it and process this data in pandas for further data analysis.

The Data-Information segment repeats itself within the file and contains the necessary information.

My approach so far for the regex was to get some header information out of it. What I'm missing right now, is all three sections of data points. I only need the header from Points to the last data point. How could I grep these sections into multiple or one group?

^(?:Data-Information.*)
(?:\nName:\t )(?P<Name>. )
(?:\nSample:\t )(?P<Sample>. )
((?:\r?\n. ) )
(?:\nSystem:\t )(?P<System>. )
(?:\r?\n(?!Data-Information).*)*

Sample file

Data-Information
Name:           Polymer A
Sample:     Sunday till Monday
User:           SUD
Count Segments:         5
Application:            RHEOSTAR
Tool:           CP
Date/Time:          24.10.2021; 13:37
System:         CP25

Constants:
- Csr [min/s]:          2,5421
- Css [Pa/mNm]:         2,54679

Section:            1
Number measuring points:            0

Time limit:         2 measuring points, drop
            Duration 30 s
Measurement profile:
  Temperature           T[-1] = 25 °C

Section:            2
Number measuring points:            30

Time limit:         30 measuring points
            Duration 2 s

Points  Time    Viscosity   Shear rate  Shear stress    Momentum    Status
    [s] [Pa·s]  [1/s]   [Pa]    [mNm]   []
1   62  10,93   100 1.090   4,45    TGC,Dy_
2   64  11,05   100 1.100   4,5 TGC,Dy_
3   66  11,07   100 1.110   4,51    TGC,Dy_
4   68  11,05   100 1.100   4,5 TGC,Dy_
5   70  10,99   100 1.100   4,47    TGC,Dy_
6   72  10,92   100 1.090   4,44    TGC,Dy_


Section:            3
Number measuring points:            0

Time limit:         2 measuring points, drop
            Duration 60 s

Section:            4
Number measuring points:            30

Time limit:         30 measuring points
            Duration 2 s

Points  Time    Viscosity   Shear rate  Shear stress    Momentum    Status
    [s] [Pa·s]  [1/s]   [Pa]    [mNm]   []
*** 1 ***   242 -6,334E 6   -0,0000115  72,7    0,296   TGC,Dy_
2   244 63,94   10,3    661 2,69    TGC,Dy_
3   246 35,56   20,7    736 2,99    TGC,Dy_
4   248 25,25   31  784 3,19    TGC,Dy_
5   250 19,82   41,4    820 3,34    TGC,Dy_


Section:            5
Number measuring points:            300

Time limit:         300 measuring points
            Duration 1 s

Points  Time    Viscosity   Shear rate  Shear stress    Momentum    Status
    [s] [Pa·s]  [1/s]   [Pa]    [mNm]   []
1   301 4,142   300 1.240   5,06    TGC,Dy_
2   302 4,139   300 1.240   5,05    TGC,Dy_
3   303 4,138   300 1.240   5,05    TGC,Dy_
4   304 4,141   300 1.240   5,06    TGC,Dy_
5   305 4,156   300 1.250   5,07    TGC,Dy_
6   306 4,153   300 1.250   5,07    TGC,Dy_


Data-Information
Name:           Polymer B
Sample:     Monday till Tuesday
User:           SUD
Count Segments:         5
Application:            RHEOSTAR
Tool:           CP
Date/Time:          24.10.2021; 13:37
System:         CP25

Constants:
- Csr [min/s]:          2,5421
- Css [Pa/mNm]:         2,54679

Section:            1
Number measuring points:            0

Time limit:         2 measuring points, drop
            Duration 30 s
Measurement profile:
  Temperature           T[-1] = 25 °C

Section:            2
Number measuring points:            30

Time limit:         30 measuring points
            Duration 2 s

Points  Time    Viscosity   Shear rate  Shear stress    Momentum    Status
    [s] [Pa·s]  [1/s]   [Pa]    [mNm]   []
1   62  10,93   100 1.090   4,45    TGC,Dy_
2   64  11,05   100 1.100   4,5 TGC,Dy_
3   66  11,07   100 1.110   4,51    TGC,Dy_
4   68  11,05   100 1.100   4,5 TGC,Dy_
5   70  10,99   100 1.100   4,47    TGC,Dy_
6   72  10,92   100 1.090   4,44    TGC,Dy_


Section:            3
Number measuring points:            0

Time limit:         2 measuring points, drop
            Duration 60 s

Section:            4
Number measuring points:            30

Time limit:         30 measuring points
            Duration 2 s

Points  Time    Viscosity   Shear rate  Shear stress    Momentum    Status
    [s] [Pa·s]  [1/s]   [Pa]    [mNm]   []
*** 1 ***   242 -6,334E 6   -0,0000115  72,7    0,296   TGC,Dy_
2   244 63,94   10,3    661 2,69    TGC,Dy_
3   246 35,56   20,7    736 2,99    TGC,Dy_
4   248 25,25   31  784 3,19    TGC,Dy_
5   250 19,82   41,4    820 3,34    TGC,Dy_


Section:            5
Number measuring points:            300

Time limit:         300 measuring points
            Duration 1 s

Points  Time    Viscosity   Shear rate  Shear stress    Momentum    Status
    [s] [Pa·s]  [1/s]   [Pa]    [mNm]   []
1   301 4,142   300 1.240   5,06    TGC,Dy_
2   302 4,139   300 1.240   5,05    TGC,Dy_
3   303 4,138   300 1.240   5,05    TGC,Dy_
4   304 4,141   300 1.240   5,06    TGC,Dy_
5   305 4,156   300 1.250   5,07    TGC,Dy_
6   306 4,153   300 1.250   5,07    TGC,Dy_

CodePudding user response:

One option is to do it in 2 steps.

First get all the Data-Information parts using a pattern that starts with Data-Information and matches all following lines that do not start with Data-Information.

^Data-Information(?:\n(?!Data-Information$).*)*

Regex demo for Data-Information

The for every part, you can match the line that start with Points, and then match all following lines that contain at least a character (no empty lines)

^Points\b.*(?:\n. ) 

Regex demo for Points

  • Related