Home > Enterprise >  Parse a file using regex
Parse a file using regex

Time:06-11

I have a large text file (basically a csv file but it has a lot of different sections, the file to me does not look like a proper csv), part of the file is given below:

7.27.27.2. Frame Counts: 2

Timestamp,Transmitted,Received Seconds,Frames,
1.818,"47,702","24,026"
2.847,"121,038","66,424"
3.818,"192,749","105,993"
4.851,"270,454","147,068"
5.817,"343,582","184,994"
6.818,"422,937","227,679"
7.847,"494,787","268,220"
8.847,"568,388","307,350"
9.818,"636,640","344,092"
10.824,"712,211","383,849"
11.846,"786,823","423,941"
12.818,"863,526","465,542"
13.847,"936,019","504,298"
14.847,"1,007,358","543,600"
15.847,"1,072,079","578,770"
16.847,"1,135,907","613,742"
17.847,"1,204,749","649,329"
18.817,"1,269,150","684,052"
19.817,"1,340,923","720,234"
20.860,"1,409,920","758,060"
21.847,"1,480,912","798,166"
22.101,"1,491,235","803,900"
23.108,"1,491,235","803,900"
7.27.28. Frame Rate

Rates can vary due to round-off errors in calculations. Timestamp,Transmit rate,Receive rate Seconds,Frames/s,
1.818,"39,450","39,390"
2.847,"112,400","112,500"
3.818,"114,600","114,600"
4.851,"115,000","115,000"
5.817,"115,000","114,900"
6.818,"121,900","121,600"
7.847,"109,200","109,500"
8.847,"112,700","112,600"
9.818,"108,100","108,200"
10.824,"114,700","114,600"
11.846,"112,200","112,200"
12.818,"121,700","121,700"
13.847,"108,100","108,100"
14.847,"110,600","110,600"
15.847,"99,900","99,770"
16.847,"98,790","98,910"
17.847,"104,400","104,400"
18.817,"102,200","102,300"
19.817,"108,000","108,000"
20.860,"102,400","102,400"
21.847,"112,500","112,600"
22.101,"63,410","63,470"
23.108,0.00,0.00
7.27.28.1. Frame Rate: 1








Test Model: IPSEC-JENKINS Version: 53 Result: canceled Date: June 10, 2022 5:10:46 AM PDT Test Duration: 00:00:25.436
7. Test Results for IPSEC
7.1. Component Description Component: Application Simulator


Component,Resource Used IPSEC,np3-0
7.2. Test Component Criteria Number,Description 1,The total number of sessions opened must reach the specified target within the allotted time.: (maxConcurrentAppFlows>=sessions.target) 2,The total number of failed application transactions must be no more than 5 percent of the attempted application transactions.: ((appUnsuccessful*100)<=(appAttempted*5)) 3,The session rate must reach the specified target within the allotted time.: (maxAppFlowRate>=sessions.targetPerSecond)
7.3. Settings Parameter,Value Resource Percentage,50 Application Profile,MixCISCO MIX 4451 Delay Start,00:00:00 Data Rate/Data Rate Unlimited,false Data Rate/Data Rate Scope,Limit Aggregate Throughput Data Rate/Data Rate Unit,Megabits / Second Data Rate/Data Rate Type,Constant Data Rate/Minimum Data Rate,10000 Data Rate/Maximum Data Rate,10000 Session/Super Flow Configuration/Maximum Simultaneous Super Flows,1030 Session/Super Flow Configuration/Maximum Simultaneous Active Flows,0 Session/Super Flow Configuration/Maximum Super Flows Per Second,1030 Session/Super Flow Configuration/Unlimited Super Flow Open Rate,false Session/Super Flow Configuration/Unlimited Super Flow Close Rate,false Session/Super Flow Configuration/Target Minimum Simultaneous Flows,1 Session/Super Flow Configuration/Target Minimum Super Flows Per Second,1 Session/Super Flow Configuration/Target Number of Successful Matches,0 Session/Super Flow Configuration/Engine Selection,Advanced (Max Features) Session/Super Flow Configuration/Performance Emphasis,Balanced Session/Super Flow Configuration/Resource Allocation Override,Automatic Session/Super Flow Configuration/Statistic Detail,Maximum App Configuration/Remove all DNS actions,false App Configuration/Streams Per Super Flow,1 App Configuration/Content Fidelity,Normal App Configuration/Replace Streams at Runtime,true Source Port/Port Distribution Type,Random Source Port/Minimum Port Number,1024 Source Port/Maximum Port Number,65535 TCP Configuration/Maximum Segment Size (MSS),1260 TCP Configuration/Aging Time Data Type,Seconds TCP Configuration/Aging Time,0 TCP Configuration/Reset at End,false TCP Configuration/Retry Quantum,500 TCP Configuration/Retry Count,3 TCP Configuration/Delay ACKs,true TCP Configuration/Disable Piggy-back data on ACK (experimental),false TCP Configuration/Delayed ACKs ms,0 TCP Configuration/ACK every N (experimental),0 TCP Configuration/Initial Receive Window,5792 TCP Configuration/TCP Window Scale,0 TCP Configuration/Dynamic Receive Window Size,true TCP Configuration/Add Segment Timestamps,true TCP Configuration/Piggy-back Data on 3-way Handshake ACK,false TCP Configuration/Piggy-back Data on Shutdown FIN,false TCP Configuration/Initial Congestion Window,4 TCP Configuration/Explicit Congestion Notification,Support ECN TCP Configuration/Raw Flags,-1 TCP Configuration/Connect Delay,0 TCP Configuration/TCP Keepalive Timer,0 TCP Configuration/4-way Close,false TCP Configuration/Send PSH with all data segments,false IPv4 Configuration/TTL,32 IPv4 Configuration/TOS/DSCP,0x0 IPv6 Configuration/Hop Limit,64 IPv6 Configuration/Traffic Class,0x0 IPv6 Configuration/Flow Label,0x0 SSL Configuration/Session Reuse Capacity,Low SSL Configuration/Server Record Length,0 SSL Configuration/Client Record Length,0 Ramp Up Profile/Ramp Up Profile Type,Calculated Ramp Up Profile/Min Connection Rate,1 Ramp Up Profile/Max Connection Rate,1 Ramp Up Profile/Increment n Connections per Interval,1 Ramp Up Profile/Fixed Time Interval,00:00:01 Session Ramp Distribution/Ramp Up Behavior,Full Open Session Ramp Distribution/SYN Only Retry Mode,Obey Retry Count Session Ramp Distribution/Ramp Up Duration,00:00:00 Session Ramp Distribution/Steady-State Behavior,Open and Close Sessions Session Ramp Distribution/Steady-State Time Interval,00:02:15 Session Ramp Distribution/Ramp Down Behavior,Full Close Session Ramp Distribution/Ramp Down Time Interval,00:00:05 Experimental Advanced Settings/TCP Segments Credit,32 Experimental Advanced Settings/Send maximum size segments when possible,false Load Profile/,None Preset the component was created from,Appsim Default
7.4. App Profile Summary Weighted by flows Name,Weight,% Bandwidth,% Flows,Bytes,Flows,Seed CISCO MARCH G729 - DIA,"15,392",,,,,1 CISCO MARCH HTTP APPLICATION - DIA,"6,453",,,,,1 CISCO MARCH HTTP 32K GET - DIA,"14,969",,,,,1 CISCO MARCH HTTPS 16K - DIA,"31,729",,,,,1 CISCO MARCH CITRIX - DIA,282,,,,,1 CISCO MARCH HTTPS 64K - DIA,"9,130",,,,,1 CISCO MARCH MS-EXCHANGE - DIA,"13,212",,,,,1 CISCO MARCH HTTPS Live Streaming - DIA,584,,,,,1 CISCO MARCH HTTPS 1024K - DIA,617,,,,,1 CISCO MARCH H264 Video New - DIA,"6,576",,,,,1 CISCO MARCH POP3BANDWIDTH,95,,,,,1 CISCO MARCH SMTP,956,,,,,1
7.5. Traffic Appearance Traffic was addressed as defined in the "IPSEC-CURIE" network neighborhood. Interface,Traffic Direction,Network Domain,VLAN,Address Range 1,Client,CLIENT,,2.0.0.10
- 2.0.0.109 2,Server,SERVER,,5.0.0.10 - 5.0.0.109
7.6. Component Results Component,Result IPSEC,canceled
7.7. Application Aggregate Flows

There may be slices in this graph that are too small to be displayed. Protocol,Aggregate Flows (Flows),Aggregate Flows (%) SMTP,242,1.101% RTP,295,1.342% DNS,185,0.842% POP3-Advanced,25,0.114% HTTP,"17,440",79.345% Citrix,69,0.314% Microsoft Exchange,"3,724",16.943%

I want to extract the content for section 7.27.28 which is this:

1.818,"39,450","39,390"
2.847,"112,400","112,500"
3.818,"114,600","114,600"
4.851,"115,000","115,000"
5.817,"115,000","114,900"
6.818,"121,900","121,600"
7.847,"109,200","109,500"
8.847,"112,700","112,600"
9.818,"108,100","108,200"
10.824,"114,700","114,600"
11.846,"112,200","112,200"
12.818,"121,700","121,700"
13.847,"108,100","108,100"
14.847,"110,600","110,600"
15.847,"99,900","99,770"
16.847,"98,790","98,910"
17.847,"104,400","104,400"
18.817,"102,200","102,300"
19.817,"108,000","108,000"
20.860,"102,400","102,400"
21.847,"112,500","112,600"
22.101,"63,410","63,470"
23.108,0.00,0.00

To read the above data, I am thinking of using regex and then parse the section using csv, but the below code is not working:

pattern = r"""7.27.28. Frame Rate

Rates can vary due to round-off errors in calculations.
Timestamp,Transmit rate,Receive rate
Seconds,Frames/s,
(.*)
7.27.28.1. Frame Rate: 1"""
match = re.search(pattern, all_of_it)
print(match.group(1))

Please let me know the proper pattern for it OR is there any other way to extract the data?

CodePudding user response:

This is not an answer with regex, but might still be useful.

The key trick here is to use text.split("\n\n") to partition on blank lines, and then pick the segment of interest using startswith.

text = """
7.27.27.2. Frame Counts: 2

Timestamp,Transmitted,Received Seconds,Frames,
1.818,"47,702","24,026"
2.847,"121,038","66,424"
3.818,"192,749","105,993"
4.851,"270,454","147,068"
5.817,"343,582","184,994"
6.818,"422,937","227,679"
7.847,"494,787","268,220"
8.847,"568,388","307,350"
9.818,"636,640","344,092"
10.824,"712,211","383,849"
11.846,"786,823","423,941"
12.818,"863,526","465,542"
13.847,"936,019","504,298"
14.847,"1,007,358","543,600"
15.847,"1,072,079","578,770"
16.847,"1,135,907","613,742"
17.847,"1,204,749","649,329"
18.817,"1,269,150","684,052"
19.817,"1,340,923","720,234"
20.860,"1,409,920","758,060"
21.847,"1,480,912","798,166"
22.101,"1,491,235","803,900"
23.108,"1,491,235","803,900"
7.27.28. Frame Rate

Rates can vary due to round-off errors in calculations. Timestamp,Transmit rate,Receive rate Seconds,Frames/s,
1.818,"39,450","39,390"
2.847,"112,400","112,500"
3.818,"114,600","114,600"
4.851,"115,000","115,000"
5.817,"115,000","114,900"
6.818,"121,900","121,600"
7.847,"109,200","109,500"
8.847,"112,700","112,600"
9.818,"108,100","108,200"
10.824,"114,700","114,600"
11.846,"112,200","112,200"
12.818,"121,700","121,700"
13.847,"108,100","108,100"
14.847,"110,600","110,600"
15.847,"99,900","99,770"
16.847,"98,790","98,910"
17.847,"104,400","104,400"
18.817,"102,200","102,300"
19.817,"108,000","108,000"
20.860,"102,400","102,400"
21.847,"112,500","112,600"
22.101,"63,410","63,470"
23.108,0.00,0.00
7.27.28.1. Frame Rate: 1








Test Model: IPSEC-JENKINS Version: 53 Result: canceled Date: June 10, 2022 5:10:46 AM PDT Test Duration: 00:00:25.436
7. Test Results for IPSEC
7.1. Component Description Component: Application Simulator


Component,Resource Used IPSEC,np3-0
7.2. Test Component Criteria Number,Description 1,The total number of sessions opened must reach the specified target within the allotted time.: (maxConcurrentAppFlows>=sessions.target) 2,The total number of failed application transactions must be no more than 5 percent of the attempted application transactions.: ((appUnsuccessful*100)<=(appAttempted*5)) 3,The session rate must reach the specified target within the allotted time.: (maxAppFlowRate>=sessions.targetPerSecond)
7.3. Settings Parameter,Value Resource Percentage,50 Application Profile,MixCISCO MIX 4451 Delay Start,00:00:00 Data Rate/Data Rate Unlimited,false Data Rate/Data Rate Scope,Limit Aggregate Throughput Data Rate/Data Rate Unit,Megabits / Second Data Rate/Data Rate Type,Constant Data Rate/Minimum Data Rate,10000 Data Rate/Maximum Data Rate,10000 Session/Super Flow Configuration/Maximum Simultaneous Super Flows,1030 Session/Super Flow Configuration/Maximum Simultaneous Active Flows,0 Session/Super Flow Configuration/Maximum Super Flows Per Second,1030 Session/Super Flow Configuration/Unlimited Super Flow Open Rate,false Session/Super Flow Configuration/Unlimited Super Flow Close Rate,false Session/Super Flow Configuration/Target Minimum Simultaneous Flows,1 Session/Super Flow Configuration/Target Minimum Super Flows Per Second,1 Session/Super Flow Configuration/Target Number of Successful Matches,0 Session/Super Flow Configuration/Engine Selection,Advanced (Max Features) Session/Super Flow Configuration/Performance Emphasis,Balanced Session/Super Flow Configuration/Resource Allocation Override,Automatic Session/Super Flow Configuration/Statistic Detail,Maximum App Configuration/Remove all DNS actions,false App Configuration/Streams Per Super Flow,1 App Configuration/Content Fidelity,Normal App Configuration/Replace Streams at Runtime,true Source Port/Port Distribution Type,Random Source Port/Minimum Port Number,1024 Source Port/Maximum Port Number,65535 TCP Configuration/Maximum Segment Size (MSS),1260 TCP Configuration/Aging Time Data Type,Seconds TCP Configuration/Aging Time,0 TCP Configuration/Reset at End,false TCP Configuration/Retry Quantum,500 TCP Configuration/Retry Count,3 TCP Configuration/Delay ACKs,true TCP Configuration/Disable Piggy-back data on ACK (experimental),false TCP Configuration/Delayed ACKs ms,0 TCP Configuration/ACK every N (experimental),0 TCP Configuration/Initial Receive Window,5792 TCP Configuration/TCP Window Scale,0 TCP Configuration/Dynamic Receive Window Size,true TCP Configuration/Add Segment Timestamps,true TCP Configuration/Piggy-back Data on 3-way Handshake ACK,false TCP Configuration/Piggy-back Data on Shutdown FIN,false TCP Configuration/Initial Congestion Window,4 TCP Configuration/Explicit Congestion Notification,Support ECN TCP Configuration/Raw Flags,-1 TCP Configuration/Connect Delay,0 TCP Configuration/TCP Keepalive Timer,0 TCP Configuration/4-way Close,false TCP Configuration/Send PSH with all data segments,false IPv4 Configuration/TTL,32 IPv4 Configuration/TOS/DSCP,0x0 IPv6 Configuration/Hop Limit,64 IPv6 Configuration/Traffic Class,0x0 IPv6 Configuration/Flow Label,0x0 SSL Configuration/Session Reuse Capacity,Low SSL Configuration/Server Record Length,0 SSL Configuration/Client Record Length,0 Ramp Up Profile/Ramp Up Profile Type,Calculated Ramp Up Profile/Min Connection Rate,1 Ramp Up Profile/Max Connection Rate,1 Ramp Up Profile/Increment n Connections per Interval,1 Ramp Up Profile/Fixed Time Interval,00:00:01 Session Ramp Distribution/Ramp Up Behavior,Full Open Session Ramp Distribution/SYN Only Retry Mode,Obey Retry Count Session Ramp Distribution/Ramp Up Duration,00:00:00 Session Ramp Distribution/Steady-State Behavior,Open and Close Sessions Session Ramp Distribution/Steady-State Time Interval,00:02:15 Session Ramp Distribution/Ramp Down Behavior,Full Close Session Ramp Distribution/Ramp Down Time Interval,00:00:05 Experimental Advanced Settings/TCP Segments Credit,32 Experimental Advanced Settings/Send maximum size segments when possible,false Load Profile/,None Preset the component was created from,Appsim Default
7.4. App Profile Summary Weighted by flows Name,Weight,% Bandwidth,% Flows,Bytes,Flows,Seed CISCO MARCH G729 - DIA,"15,392",,,,,1 CISCO MARCH HTTP APPLICATION - DIA,"6,453",,,,,1 CISCO MARCH HTTP 32K GET - DIA,"14,969",,,,,1 CISCO MARCH HTTPS 16K - DIA,"31,729",,,,,1 CISCO MARCH CITRIX - DIA,282,,,,,1 CISCO MARCH HTTPS 64K - DIA,"9,130",,,,,1 CISCO MARCH MS-EXCHANGE - DIA,"13,212",,,,,1 CISCO MARCH HTTPS Live Streaming - DIA,584,,,,,1 CISCO MARCH HTTPS 1024K - DIA,617,,,,,1 CISCO MARCH H264 Video New - DIA,"6,576",,,,,1 CISCO MARCH POP3BANDWIDTH,95,,,,,1 CISCO MARCH SMTP,956,,,,,1
7.5. Traffic Appearance Traffic was addressed as defined in the "IPSEC-CURIE" network neighborhood. Interface,Traffic Direction,Network Domain,VLAN,Address Range 1,Client,CLIENT,,2.0.0.10
- 2.0.0.109 2,Server,SERVER,,5.0.0.10 - 5.0.0.109
7.6. Component Results Component,Result IPSEC,canceled
7.7. Application Aggregate Flows

There may be slices in this graph that are too small to be displayed. Protocol,Aggregate Flows (Flows),Aggregate Flows (%) SMTP,242,1.101% RTP,295,1.342% DNS,185,0.842% POP3-Advanced,25,0.114% HTTP,"17,440",79.345% Citrix,69,0.314% Microsoft Exchange,"3,724",16.943%
"""

from io import StringIO

from pandas import read_csv

for line in text.split("\n\n"):
    if line.startswith("Rates"):
        break

line = line.replace("Rates can vary due to round-off errors in calculations. ", "")
df = read_csv(StringIO(line))

CodePudding user response:

There are better solutions (regex-based). There may still be a way to write less regex, but I am not an expert! Sorry for the bad variable naming!

import re 

text = "all your text"

LONG_LINE = "Rates can vary due to round-off errors in calculations. Timestamp,Transmit rate,Receive rate Seconds,Frames/s,"
LAST_ROW = "7.27.28.1. Frame Rate: 1"
regex = re.compile(f"({LONG_LINE})(.*)({LAST_ROW})", re.MULTILINE|re.DOTALL)
m = regex.search(text)
your_section = "".join(m.groups(2)[1])


regex2 = re.compile("(^\d)(.*)", re.MULTILINE|re.DOTALL)
m2 = regex2.search(your_section)
print("".join(m2.groups()).strip())
1.818,"39,450","39,390"
2.847,"112,400","112,500"
3.818,"114,600","114,600"
4.851,"115,000","115,000"
5.817,"115,000","114,900"
6.818,"121,900","121,600"
7.847,"109,200","109,500"
8.847,"112,700","112,600"
9.818,"108,100","108,200"
10.824,"114,700","114,600"
11.846,"112,200","112,200"
12.818,"121,700","121,700"
13.847,"108,100","108,100"
14.847,"110,600","110,600"
15.847,"99,900","99,770"
16.847,"98,790","98,910"
17.847,"104,400","104,400"
18.817,"102,200","102,300"
19.817,"108,000","108,000"
20.860,"102,400","102,400"
21.847,"112,500","112,600"
22.101,"63,410","63,470"
23.108,0.00,0.00
  • Related