Home > Blockchain >  Cleaning multiple lines in a text file from not needed characters
Cleaning multiple lines in a text file from not needed characters

Time:11-01

I am trying to clean up some text files from multiple spaces and other characters. I want only the text within double quotes to remain in the line.

Here is the example of the text file:

   "uid" : "Text To Remain", 
 "id" : "Text2 To Stay",

Note the empty characters / tabs at the beginning of each line and comma at the end of each line.

So I thought the easiest way to get rid of those empty spaces on the left would be using the regular expression. In every line there is a space_colon_space string: " : ", so I try to erase everything to the left of it, including the string itself.

I came up with two examples of solutions:

get-content 'K:\text.txt' -ReadCount 1000 |
 ForEach-Object {
 $_.replace(".* : ", "").replace(",", "")
 } |
  Out-File 'K:\text_cleaned.txt'

This solution works only for the comma, but does not work for the colon. There is no error.

Second solution:

get-content 'K:\text.txt' -ReadCount 1000 |
 foreach { $_ -replace ".* : " |  out-file 'K:\text_cleaned.txt'
}

And this works and cleans up everything on the left of the first double quote character, but I have no idea how to add a function to replace the comma at the end of each line in the same line.

Why not to do it in a simpler way?

I am very curious why the regular expression /.* : / in the first solution does not work, while the one in the second does work. And there is no error in the first one.

Could you enlighten me?

CodePudding user response:

Try the following:

(Get-Content 'K:\text.txt' -ReadCount 0) -replace '.  : "|",\s*$' |
   Out-File 'K:\text_cleaned.txt'

Output:

Text To Remain
Text2 To Stay
  • -ReadCount 0 reads the entire file into a single array, at once, which greatly speeds up processing.

  • The -replace operation effectively replaces all characters from the start of each line through the " following:  : , as well as the last " if followed by a , and potentially by whitespace at the end of the line.

Note: The assumption is that verbatim substring  : " only occurs between "..." strings, not also embedded in them, say as in "Foo "" : "" bar"


As for what you tried:

$_.replace(".* : ", "")

  • The .Replace() method of the .NET [string] type only performs literal (verbatim) substitutions, so an attempt to use a regex cannot work.

  • By contrast, PowerShell's -replace operator, is regex-based. Also note that, unlike the .Replace method, it is case-insensitive by default (though you may use its -creplace variant for case-sensitive replacements).

See this answer for more information and guidance on when to use -replace vs. .Replace().

  • Related