Home > Software design >  Extract JSon Format from Large Text File
Extract JSon Format from Large Text File

Time:05-04

Introduction

Hi! I'm trying to extract JSon from a 300K line text file that has a combination of Text output and JSon format from HTTP Result. The big size in lines makes it unable to retain the JSon manually.

Problematic

Don't have much choice, i probably need to fix it manually using a command-line. Here's how it's looks like inside the file:

[2K  100.00% - C: 164148 / 164149 - S: 263 - F: 3686 - dhcp-140-247-148-215.fas.harvard.edu:443 - id3.sshws.me
[2K  100.00% - C: 164149 / 164149 - S: 263 - F: 3686 - public-1300503051.cos.ap-shanghai.myqcloud.com:443 - id3.sshws.me
[2K
[
  {
    "Request": {
      "ProxyHost": "pro.ant.design",
      "ProxyPort": 443,
      "Bug": "pro.ant.design",
      "Method": "HEAD",
      "Target": "id3.sshws.me",
      "Payload": "GET wss://pro.ant.design/ HTTP/1.1[crlf]Host: [host][crlf]Upgrade: websocket[crlf][crlf]"
    },
    "ResponseLine": [
      "HTTP/1.1 101 Switching Protocol",
      "Server: cloudflare"
    ]
  },
  {
    "Request": {
      "ProxyHost": "industrialtech.ft.com",
      "ProxyPort": 443,
      "Bug": "industrialtech.ft.com",
      "Method": "HEAD",
      "Target": "id3.sshws.me",
      "Payload": "GET wss://industrialtech.ft.com/ HTTP/1.1[crlf]Host: [host][crlf]Upgrade: websocket[crlf][crlf]"
    },
    "ResponseLine": [
      "HTTP/1.1 101 Switching Protocol",
      "Server: cloudflare"
    ]
  }
]

Several problem to this if using RegEx is:

  • It has multiple JSon object

  • The Text string that doesn't part of JSon has [ and :

I realize the problem when trying to use sed regex.

sed '/^[/,/^]/!d'

CodePudding user response:

You can remove all lines that start with [ and any non-whitespace char:

sed '/^\[[^[:space:]]/d' file > newfile

Details:

  • ^ - start of a line
  • \[ - [ char
  • [^[:space:]] - any non-whitespace chars.

CodePudding user response:

Alternate way to this is; to get advantage for special char. If someone wanted to remove the progress bar from the output and extract only appropiate output:

  1. Use nano <output_file>
  2. You will see that there's new line unicode got readed as ^M^[ in the first text. I assume it's the same as [crlf]
  3. Use sed -e "/\^M^[/d" remove lines that contains specific unicode.

Use \ to escape ^ as RegEx.

Make sure to always find the pattern readed from the terminal and not inside a text editor app, as some of them couldn't read Unicodes.

  • Related