I have a quiet big file of data, which is not in a really good state for further processing. So I want to regex the best out of it and process this data in pandas for further data analysis.
The Data-Information
segment repeats itself within the file and contains the necessary information.
My approach so far for the regex was to get some header information out of it. What I'm missing right now, is all three sections of data points. I only need the header from Points
to the last data point. How could I grep these sections into multiple or one group?
^(?:Data-Information.*)
(?:\nName:\t )(?P<Name>. )
(?:\nSample:\t )(?P<Sample>. )
((?:\r?\n. ) )
(?:\nSystem:\t )(?P<System>. )
(?:\r?\n(?!Data-Information).*)*
Sample file
Data-Information
Name: Polymer A
Sample: Sunday till Monday
User: SUD
Count Segments: 5
Application: RHEOSTAR
Tool: CP
Date/Time: 24.10.2021; 13:37
System: CP25
Constants:
- Csr [min/s]: 2,5421
- Css [Pa/mNm]: 2,54679
Section: 1
Number measuring points: 0
Time limit: 2 measuring points, drop
Duration 30 s
Measurement profile:
Temperature T[-1] = 25 °C
Section: 2
Number measuring points: 30
Time limit: 30 measuring points
Duration 2 s
Points Time Viscosity Shear rate Shear stress Momentum Status
[s] [Pa·s] [1/s] [Pa] [mNm] []
1 62 10,93 100 1.090 4,45 TGC,Dy_
2 64 11,05 100 1.100 4,5 TGC,Dy_
3 66 11,07 100 1.110 4,51 TGC,Dy_
4 68 11,05 100 1.100 4,5 TGC,Dy_
5 70 10,99 100 1.100 4,47 TGC,Dy_
6 72 10,92 100 1.090 4,44 TGC,Dy_
Section: 3
Number measuring points: 0
Time limit: 2 measuring points, drop
Duration 60 s
Section: 4
Number measuring points: 30
Time limit: 30 measuring points
Duration 2 s
Points Time Viscosity Shear rate Shear stress Momentum Status
[s] [Pa·s] [1/s] [Pa] [mNm] []
*** 1 *** 242 -6,334E 6 -0,0000115 72,7 0,296 TGC,Dy_
2 244 63,94 10,3 661 2,69 TGC,Dy_
3 246 35,56 20,7 736 2,99 TGC,Dy_
4 248 25,25 31 784 3,19 TGC,Dy_
5 250 19,82 41,4 820 3,34 TGC,Dy_
Section: 5
Number measuring points: 300
Time limit: 300 measuring points
Duration 1 s
Points Time Viscosity Shear rate Shear stress Momentum Status
[s] [Pa·s] [1/s] [Pa] [mNm] []
1 301 4,142 300 1.240 5,06 TGC,Dy_
2 302 4,139 300 1.240 5,05 TGC,Dy_
3 303 4,138 300 1.240 5,05 TGC,Dy_
4 304 4,141 300 1.240 5,06 TGC,Dy_
5 305 4,156 300 1.250 5,07 TGC,Dy_
6 306 4,153 300 1.250 5,07 TGC,Dy_
Data-Information
Name: Polymer B
Sample: Monday till Tuesday
User: SUD
Count Segments: 5
Application: RHEOSTAR
Tool: CP
Date/Time: 24.10.2021; 13:37
System: CP25
Constants:
- Csr [min/s]: 2,5421
- Css [Pa/mNm]: 2,54679
Section: 1
Number measuring points: 0
Time limit: 2 measuring points, drop
Duration 30 s
Measurement profile:
Temperature T[-1] = 25 °C
Section: 2
Number measuring points: 30
Time limit: 30 measuring points
Duration 2 s
Points Time Viscosity Shear rate Shear stress Momentum Status
[s] [Pa·s] [1/s] [Pa] [mNm] []
1 62 10,93 100 1.090 4,45 TGC,Dy_
2 64 11,05 100 1.100 4,5 TGC,Dy_
3 66 11,07 100 1.110 4,51 TGC,Dy_
4 68 11,05 100 1.100 4,5 TGC,Dy_
5 70 10,99 100 1.100 4,47 TGC,Dy_
6 72 10,92 100 1.090 4,44 TGC,Dy_
Section: 3
Number measuring points: 0
Time limit: 2 measuring points, drop
Duration 60 s
Section: 4
Number measuring points: 30
Time limit: 30 measuring points
Duration 2 s
Points Time Viscosity Shear rate Shear stress Momentum Status
[s] [Pa·s] [1/s] [Pa] [mNm] []
*** 1 *** 242 -6,334E 6 -0,0000115 72,7 0,296 TGC,Dy_
2 244 63,94 10,3 661 2,69 TGC,Dy_
3 246 35,56 20,7 736 2,99 TGC,Dy_
4 248 25,25 31 784 3,19 TGC,Dy_
5 250 19,82 41,4 820 3,34 TGC,Dy_
Section: 5
Number measuring points: 300
Time limit: 300 measuring points
Duration 1 s
Points Time Viscosity Shear rate Shear stress Momentum Status
[s] [Pa·s] [1/s] [Pa] [mNm] []
1 301 4,142 300 1.240 5,06 TGC,Dy_
2 302 4,139 300 1.240 5,05 TGC,Dy_
3 303 4,138 300 1.240 5,05 TGC,Dy_
4 304 4,141 300 1.240 5,06 TGC,Dy_
5 305 4,156 300 1.250 5,07 TGC,Dy_
6 306 4,153 300 1.250 5,07 TGC,Dy_
CodePudding user response:
One option is to do it in 2 steps.
First get all the Data-Information
parts using a pattern that starts with Data-Information and matches all following lines that do not start with Data-Information.
^Data-Information(?:\n(?!Data-Information$).*)*
Regex demo for Data-Information
The for every part, you can match the line that start with Points, and then match all following lines that contain at least a character (no empty lines)
^Points\b.*(?:\n. )