Home > front end >  How to do multi string replace in a multiline text file that has almost a million lines using PowerS
How to do multi string replace in a multiline text file that has almost a million lines using PowerS

Time:05-27

I have a few log files ( mosquitto broker logs) that have the following format

1652855102: New connection from xx.xx.xx.xx on port xxxx.
1652855102: Socket error on client <unknown>, disconnecting.
1652855102: Received PUBLISH from 16838547124974742985 (d0, q1, r0, m235, 's/uc/mpd/16838547124974742985', ... (30 bytes))
1652855102: Sending PUBACK to 16838547124974742985 (m235, rc0)
1652855102: Sending PUBLISH to mqtt_7811e829.e2e5e8 (d0, q1, r0, m42277, 's/uc/mpd/16838547124974742985', ... (30 bytes))
1652855102: Sending PUBLISH to dad3d73-013c-4274-a782-cdd6f2ebbc77 (d0, q0, r0, m0, 's/uc/mpd/16838547124974742985', ... (30 bytes))
1652855102: Received PUBACK from mqtt_7811e829.e2e5e8 (Mid: 42277, RC:0)
1652855103: Received DISCONNECT from 16838547082470259932
1652855103: Client 16838547082470259932 disconnected.

The format is as follows: UTC time in seconds since epoch|:| Message body

I want to transform the log file into:

1652855102|New connection from xx.xx.xx.xx on port xxxx.
1652855102|Socket error on client <unknown>, disconnecting.
1652855102|Received PUBLISH from 16838547124974742985 (d0, q1, r0, m235, 's/uc/mpd/16838547124974742985', ... (30 bytes))
1652855102|Sending PUBACK to 16838547124974742985 (m235, rc0)

I understand that there are 3 groups here to be captured
Group 1 - first 10 digits
Group 2 - a colon and a space
Group 3 - everything after the space is message

$RegEx = '(?ms)^([\d]{10})(:\s)(. )'
(Get-Content -Raw ..\sample-mosquitto.log) -Replace $RegEx, '$1|$3'

My script only replaces the first line and doesn't work for the rest of the lines.

Is there any way to run the capture and replace for all lines without actually doing a for-each ?

CodePudding user response:

I suggest using

$RegEx = '^(\d{10}):\s (. )'
(Get-Content ..\sample-mosquitto.log) -Replace $RegEx, '$1|$2'

That is, remove -Raw as you want to operate on a line by line basis, and only capture the data that you need to keep (:\s does not need keeping, and neither capturing). See the updated replacement string with $1|$2.

See the regex demo.

CodePudding user response:

Your only problem is the use of inline regex option s, i.e. the SingleLine option that makes metacharacter . match any character, including newlines; in effect, this causes your regex to match the entire string, across all lines; without this option, . matches any character other than a newline, which is what you're looking for here.

Also note that character class \s matches any whitespace character, and therefore also newlines; while that isn't a problem in your case, it is better to simply match a (space) verbatim, if you know that only a space can occur in that position.

Finally, note that it is sufficient to use one capture group, because you needn't capture the separator string, given that you're replacing it with a fixed character, and you need not capture the remainder of the string, given that it should remain unchanged.

Therefore (note that inline option m (Multiline) is still needed to make ^ match the start of each line):

(Get-Content -Raw ..\sample-mosquitto.log) -replace '(?m)^(\d{10}): ', '$1|'

Re performance:

Assuming the file fits into memory as a whole (which is typically true even for large text files), use of the -Raw switch to read the file into a single, multi-line string is indeed the fastest way to process the file; by contrast, using Get-Content without -Raw would result in line-by-line processing, which is not only inherently slower, but slowed down by each line getting decorated with metadata - see GitHub issue #7537 for a discussion about a potential opt-out.

However - again assuming that the file fits into memory as a whole - there is a way to speed up Get-Content's line-by-line processing, namely via -ReadCount 0, which causes all lines to be emitted as a single array (which only as a whole is decorated with metadata):

# Slower than -Raw, but reasonably fast line-by-line processing due
# to -ReadCount 0
(Get-Content -ReadCount 0 ..\sample-mosquitto.log) -replace '^(\d{10}): ', '$1|'
  • Related