Introduction
Hi! I'm trying to extract JSon from a 300K line text file that has a combination of Text output and JSon format from HTTP Result. The big size in lines makes it unable to retain the JSon manually.
Problematic
Don't have much choice, i probably need to fix it manually using a command-line. Here's how it's looks like inside the file:
[2K 100.00% - C: 164148 / 164149 - S: 263 - F: 3686 - dhcp-140-247-148-215.fas.harvard.edu:443 - id3.sshws.me
[2K 100.00% - C: 164149 / 164149 - S: 263 - F: 3686 - public-1300503051.cos.ap-shanghai.myqcloud.com:443 - id3.sshws.me
[2K
[
{
"Request": {
"ProxyHost": "pro.ant.design",
"ProxyPort": 443,
"Bug": "pro.ant.design",
"Method": "HEAD",
"Target": "id3.sshws.me",
"Payload": "GET wss://pro.ant.design/ HTTP/1.1[crlf]Host: [host][crlf]Upgrade: websocket[crlf][crlf]"
},
"ResponseLine": [
"HTTP/1.1 101 Switching Protocol",
"Server: cloudflare"
]
},
{
"Request": {
"ProxyHost": "industrialtech.ft.com",
"ProxyPort": 443,
"Bug": "industrialtech.ft.com",
"Method": "HEAD",
"Target": "id3.sshws.me",
"Payload": "GET wss://industrialtech.ft.com/ HTTP/1.1[crlf]Host: [host][crlf]Upgrade: websocket[crlf][crlf]"
},
"ResponseLine": [
"HTTP/1.1 101 Switching Protocol",
"Server: cloudflare"
]
}
]
Several problem to this if using RegEx is:
It has multiple JSon object
The Text string that doesn't part of JSon has
[
and:
I realize the problem when trying to use sed
regex.
sed '/^[/,/^]/!d'
CodePudding user response:
You can remove all lines that start with [
and any non-whitespace char:
sed '/^\[[^[:space:]]/d' file > newfile
Details:
^
- start of a line\[
-[
char[^[:space:]]
- any non-whitespace chars.
CodePudding user response:
Alternate way to this is; to get advantage for special char. If someone wanted to remove the progress bar from the output and extract only appropiate output:
- Use
nano <output_file>
- You will see that there's new line unicode got readed as
^M^[
in the first text. I assume it's the same as[crlf]
- Use
sed -e "/\^M^[/d"
remove lines that contains specific unicode.
Use \
to escape ^
as RegEx.
Make sure to always find the pattern readed from the terminal and not inside a text editor app, as some of them couldn't read Unicodes.