I am trying to clean up some text files from multiple spaces and other characters. I want only the text within double quotes to remain in the line.
Here is the example of the text file:
"uid" : "Text To Remain", "id" : "Text2 To Stay",
Note the empty characters / tabs at the beginning of each line and comma at the end of each line.
So I thought the easiest way to get rid of those empty spaces on the left would be using the regular expression. In every line there is a space_colon_space string: " : ", so I try to erase everything to the left of it, including the string itself.
I came up with two examples of solutions:
get-content 'K:\text.txt' -ReadCount 1000 |
ForEach-Object {
$_.replace(".* : ", "").replace(",", "")
} |
Out-File 'K:\text_cleaned.txt'
This solution works only for the comma, but does not work for the colon. There is no error.
Second solution:
get-content 'K:\text.txt' -ReadCount 1000 |
foreach { $_ -replace ".* : " | out-file 'K:\text_cleaned.txt'
}
And this works and cleans up everything on the left of the first double quote character, but I have no idea how to add a function to replace the comma at the end of each line in the same line.
Why not to do it in a simpler way?
I am very curious why the regular expression /.* : / in the first solution does not work, while the one in the second does work. And there is no error in the first one.
Could you enlighten me?
CodePudding user response:
Try the following:
(Get-Content 'K:\text.txt' -ReadCount 0) -replace '. : "|",\s*$' |
Out-File 'K:\text_cleaned.txt'
Output:
Text To Remain
Text2 To Stay
-ReadCount 0
reads the entire file into a single array, at once, which greatly speeds up processing.The
-replace
operation effectively replaces all characters from the start of each line through the"
following::
, as well as the last"
if followed by a,
and potentially by whitespace at the end of the line.- For an explanation of the regex and the ability to experiment with it, see this regex101.com page.
Note: The assumption is that verbatim substring : "
only occurs between "..."
strings, not also embedded in them, say as in "Foo "" : "" bar"
As for what you tried:
$_.replace(".* : ", "")
The
.Replace()
method of the .NET[string]
type only performs literal (verbatim) substitutions, so an attempt to use a regex cannot work.By contrast, PowerShell's
-replace
operator, is regex-based. Also note that, unlike the.Replace
method, it is case-insensitive by default (though you may use its-creplace
variant for case-sensitive replacements).
See this answer for more information and guidance on when to use -replace
vs. .Replace()
.