Home > front end >  How to convert remove "\n" & "\" from the string & convert that string to JSON
How to convert remove "\n" & "\" from the string & convert that string to JSON

Time:11-03

I want to convert this data into json or python dictionary.

What would be the best possible way of doing it as I have at least 3GB of data like this.

"{\"domain\":\"defb2b00f609c6bb8fcfd43af70c146bc8a26036f800e8f7563bc366fb88aa1f\",\"path\":\"/\",\"scope\":\"domain\",\"categories\":[9,66],\"confidence\":0.90371,\"tier\":2}\n{\"domain\":\"e7378a78724fcd254b59764f451be766d0e1c6683eac9aa3d5f29798600d91af\",\"path\":\"/\",\"scope\":\"domain\",\"categories\":[9,59],\"confidence\":0.90371,\"tier\":2}\n{\"domain\":\"5f616b8a7b283395961018da6ac75a563efdfcf743ce7e1cd1bcbec0a23a5349\",\"path\":\"/\",\"scope\":\"domain\",\"categories\":[18],\"confidence\":0.70767,\"tier\":2}\n{\"domain\":\"4e219482bd58e2c7d91c55e52aa5db37785f29314e4db9c319ab9edb9ee5de1e\",\"path\":\"/\",\"scope\":\"domain\",\"categories\":[9],\"confidence\":0.8198000000000001,\"tier\":2}\n{\"domain\":\"e3ad60e6f8786da0253d8ce00fcb90ee5bf497a75c5b42753acba203800ad6fe\",\"path\":\"/\",\"scope\":\"domain\",\"categories\":[9],\"confidence\":0.8198000000000001,\"tier\":2}\n{\"domain\":\"49ae5ad8cc8de0136f3f99a1330710a14912fb743578d4ce39318281979162b4\",\"path\":\"/\",\"scope\":\"domain\",\"categories\":[26],\"confidence\":0.9594,\"tier\":2}\n{\"domain\":\"f93af67b58299d7317841de70464624ffc0190b93a4af860dcc58038162a30cf\",\"path\":\"/\",\"scope\":\"domain\",\"categories\":[9,62],\"confidence\":0.70767,\"tier\":2}\n{\"domain\":\"c9356044593f00f2b779cbd59246695e654bf1f105a265867d4f06c1bb6a2ea2\",\"path\":\"/\",\"scope\":\"domain\",\"categories\":[66,68],\"confidence\":0.70767,\"tier\":2}\n{\"domain\":\"7920da2f3dc7de646d3434d467ffccdd8ca31115c54529bc2a4d758896ae1a19\",\"path\":\"/\",\"scope\":\"domain\",\"categories\":[4,74],\"confidence\":0.8198000000000001,\"tier\":2}\n{\"domain\":\"08a4027677824509beee405482a5f1a5f4feddbf0aafbf4b649c2909732f6909\",\"path\":\"/\",\"scope\":\"domain\",\"categories\":[75],\"confidence\":0.8198000000000001,\"tier\":2}\n"

CodePudding user response:

I suggest you fix what is generating the data. Otherwise, split the newlines

>>> s = "... your string ..."
>>> data = [json.loads(x) for x in s.splitlines()]
>>> for x in data:
...   print(x['tier'])
2
2
2
2
2
2
2
2
2
2

https://jsonlines.org/

CodePudding user response:

You can turn it to a bytes type string, then decode it using unicode_escape to remove the escapes. It's the opposite of re.escape:

unescaped = bytes(s, 'utf-8').decode('unicode_escape')

Then just read it as any other JSONL (jsonlines) object

list(map(json.loads, unescaped.split('\n')))

Output:

[
    {'domain': 'defb2b00f609c6bb8fcfd43af70c146bc8a26036f800e8f7563bc366fb88aa1f', 'path': '/', 'scope': 'domain', 'categories': [9, 66], 'confidence': 0.90371, 'tier': 2},
    {'domain': 'e7378a78724fcd254b59764f451be766d0e1c6683eac9aa3d5f29798600d91af', 'path': '/', 'scope': 'domain', 'categories': [9, 59], 'confidence': 0.90371, 'tier': 2},
    {'domain': '5f616b8a7b283395961018da6ac75a563efdfcf743ce7e1cd1bcbec0a23a5349', 'path': '/', 'scope': 'domain', 'categories': [18], 'confidence': 0.70767, 'tier': 2},
    {'domain': '4e219482bd58e2c7d91c55e52aa5db37785f29314e4db9c319ab9edb9ee5de1e', 'path': '/', 'scope': 'domain', 'categories': [9], 'confidence': 0.8198000000000001, 'tier': 2},
    {'domain': 'e3ad60e6f8786da0253d8ce00fcb90ee5bf497a75c5b42753acba203800ad6fe', 'path': '/', 'scope': 'domain', 'categories': [9], 'confidence': 0.8198000000000001, 'tier': 2},
    {'domain': '49ae5ad8cc8de0136f3f99a1330710a14912fb743578d4ce39318281979162b4', 'path': '/', 'scope': 'domain', 'categories': [26], 'confidence': 0.9594, 'tier': 2},
    {'domain': 'f93af67b58299d7317841de70464624ffc0190b93a4af860dcc58038162a30cf', 'path': '/', 'scope': 'domain', 'categories': [9, 62], 'confidence': 0.70767, 'tier': 2},
    {'domain': 'c9356044593f00f2b779cbd59246695e654bf1f105a265867d4f06c1bb6a2ea2', 'path': '/', 'scope': 'domain', 'categories': [66, 68], 'confidence': 0.70767, 'tier': 2},
    {'domain': '7920da2f3dc7de646d3434d467ffccdd8ca31115c54529bc2a4d758896ae1a19', 'path': '/', 'scope': 'domain', 'categories': [4, 74], 'confidence': 0.8198000000000001, 'tier': 2},
    {'domain': '08a4027677824509beee405482a5f1a5f4feddbf0aafbf4b649c2909732f6909', 'path': '/', 'scope': 'domain', 'categories': [75], 'confidence': 0.8198000000000001, 'tier': 2}
]

CodePudding user response:

We can make use of ast.literal_eval() to safely evaluate list of objects within string literal.

>>> raw_string = "{\"domain\":\"defb2b00f609c6bb8fcfd43af70c146bc8a26036f800e8f7563bc366fb88aa1f\",\"path\":\"/\",\"scope\":\"domain\",\"categories\":[9,66],\"confidence\":0.90371,\"tier\":2}\n{\"domain\":\"e7378a78724fcd254b59764f451be766d0e1c6683eac9aa3d5f29798600d91af\",\"path\":\"/\",\"scope\":\"domain\",\"categories\":[9,59],\"confidence\":0.90371,\"tier\":2}\n{\"domain\":\"5f616b8a7b283395961018da6ac75a563efdfcf743ce7e1cd1bcbec0a23a5349\",\"path\":\"/\",\"scope\":\"domain\",\"categories\":[18],\"confidence\":0.70767,\"tier\":2}\n{\"domain\":\"4e219482bd58e2c7d91c55e52aa5db37785f29314e4db9c319ab9edb9ee5de1e\",\"path\":\"/\",\"scope\":\"domain\",\"categories\":[9],\"confidence\":0.8198000000000001,\"tier\":2}\n{\"domain\":\"e3ad60e6f8786da0253d8ce00fcb90ee5bf497a75c5b42753acba203800ad6fe\",\"path\":\"/\",\"scope\":\"domain\",\"categories\":[9],\"confidence\":0.8198000000000001,\"tier\":2}\n{\"domain\":\"49ae5ad8cc8de0136f3f99a1330710a14912fb743578d4ce39318281979162b4\",\"path\":\"/\",\"scope\":\"domain\",\"categories\":[26],\"confidence\":0.9594,\"tier\":2}\n{\"domain\":\"f93af67b58299d7317841de70464624ffc0190b93a4af860dcc58038162a30cf\",\"path\":\"/\",\"scope\":\"domain\",\"categories\":[9,62],\"confidence\":0.70767,\"tier\":2}\n{\"domain\":\"c9356044593f00f2b779cbd59246695e654bf1f105a265867d4f06c1bb6a2ea2\",\"path\":\"/\",\"scope\":\"domain\",\"categories\":[66,68],\"confidence\":0.70767,\"tier\":2}\n{\"domain\":\"7920da2f3dc7de646d3434d467ffccdd8ca31115c54529bc2a4d758896ae1a19\",\"path\":\"/\",\"scope\":\"domain\",\"categories\":[4,74],\"confidence\":0.8198000000000001,\"tier\":2}\n{\"domain\":\"08a4027677824509beee405482a5f1a5f4feddbf0aafbf4b649c2909732f6909\",\"path\":\"/\",\"scope\":\"domain\",\"categories\":[75],\"confidence\":0.8198000000000001,\"tier\":2}\n"

>>> modified_string = raw_string.replace('\\','').replace('\n',',')

>>> import ast
>>> dicts = ast.literal_eval(modified_string)      # a tuple of dictionaries

>>> for d in dicts:
...     print(d)
... 
{'domain': 'defb2b00f609c6bb8fcfd43af70c146bc8a26036f800e8f7563bc366fb88aa1f', 'path': '/', 'scope': 'domain', 'categories': [9, 66], 'confidence': 0.90371, 'tier': 2}
{'domain': 'e7378a78724fcd254b59764f451be766d0e1c6683eac9aa3d5f29798600d91af', 'path': '/', 'scope': 'domain', 'categories': [9, 59], 'confidence': 0.90371, 'tier': 2}
{'domain': '5f616b8a7b283395961018da6ac75a563efdfcf743ce7e1cd1bcbec0a23a5349', 'path': '/', 'scope': 'domain', 'categories': [18], 'confidence': 0.70767, 'tier': 2}
{'domain': '4e219482bd58e2c7d91c55e52aa5db37785f29314e4db9c319ab9edb9ee5de1e', 'path': '/', 'scope': 'domain', 'categories': [9], 'confidence': 0.8198000000000001, 'tier': 2}
{'domain': 'e3ad60e6f8786da0253d8ce00fcb90ee5bf497a75c5b42753acba203800ad6fe', 'path': '/', 'scope': 'domain', 'categories': [9], 'confidence': 0.8198000000000001, 'tier': 2}
{'domain': '49ae5ad8cc8de0136f3f99a1330710a14912fb743578d4ce39318281979162b4', 'path': '/', 'scope': 'domain', 'categories': [26], 'confidence': 0.9594, 'tier': 2}
{'domain': 'f93af67b58299d7317841de70464624ffc0190b93a4af860dcc58038162a30cf', 'path': '/', 'scope': 'domain', 'categories': [9, 62], 'confidence': 0.70767, 'tier': 2}
{'domain': 'c9356044593f00f2b779cbd59246695e654bf1f105a265867d4f06c1bb6a2ea2', 'path': '/', 'scope': 'domain', 'categories': [66, 68], 'confidence': 0.70767, 'tier': 2}
{'domain': '7920da2f3dc7de646d3434d467ffccdd8ca31115c54529bc2a4d758896ae1a19', 'path': '/', 'scope': 'domain', 'categories': [4, 74], 'confidence': 0.8198000000000001, 'tier': 2}
{'domain': '08a4027677824509beee405482a5f1a5f4feddbf0aafbf4b649c2909732f6909', 'path': '/', 'scope': 'domain', 'categories': [75], 'confidence': 0.8198000000000001, 'tier': 2}

CodePudding user response:

Use regex for this:

import re
import json

regex = r"\\(\")|\\n"

test_str = ("\"{\\\"domain\\\":\\\"defb2b00f609c6bb8fcfd43af70c146bc8a26036f800e8f7563bc366fb88aa1f\\\",\\\"path\\\":\\\"/\\\",\\\"scope\\\":\\\"domain\\\",\\\"categories\\\":[9,66],\\\"confidence\\\":0.90371,\\\"tier\\\":2}\\n{\\\"domain\\\":\\\"e7378a78724fcd254b59764f451be766d0e1c6683eac9aa3d5f29798600d91af\\\",\\\"path\\\":\\\"/\\\",\\\"scope\\\":\\\"domain\\\",\\\"categories\\\":[9,59],\\\"confidence\\\":0.90371,\\\"tier\\\":2}\\n{\\\"domain\\\":\\\"5f616b8a7b283395961018da6ac75a563efdfcf743ce7e1cd1bcbec0a23a5349\\\",\\\"path\\\":\\\"/\\\",\\\"scope\\\":\\\"domain\\\",\\\"categories\\\":[18],\\\"confidence\\\":0.70767,\\\"tier\\\":2}\\n{\\\"domain\\\":\\\"4e219482bd58e2c7d91c55e52aa5db37785f29314e4db9c319ab9edb9ee5de1e\\\",\\\"path\\\":\\\"/\\\",\\\"scope\\\":\\\"domain\\\",\\\"categories\\\":[9],\\\"confidence\\\":0.8198000000000001,\\\"tier\\\":2}\\n{\\\"domain\\\":\\\"e3ad60e6f8786da0253d8ce00fcb90ee5bf497a75c5b42753acba203800ad6fe\\\",\\\"path\\\":\\\"/\\\",\\\"scope\\\":\\\"domain\\\",\\\"categories\\\":[9],\\\"confidence\\\":0.8198000000000001,\\\"tier\\\":2}\\n{\\\"domain\\\":\\\"49ae5ad8cc8de0136f3f99a1330710a14912fb743578d4ce39318281979162b4\\\",\\\"path\\\":\\\"/\\\",\\\"scope\\\":\\\"domain\\\",\\\"categories\\\":[26],\\\"confidence\\\":0.9594,\\\"tier\\\":2}\\n{\\\"domain\\\":\\\"f93af67b58299d7317841de70464624ffc0190b93a4af860dcc58038162a30cf\\\",\\\"path\\\":\\\"/\\\",\\\"scope\\\":\\\"domain\\\",\\\"categories\\\":[9,62],\\\"confidence\\\":0.70767,\\\"tier\\\":2}\\n{\\\"domain\\\":\\\"c9356044593f00f2b779cbd59246695e654bf1f105a265867d4f06c1bb6a2ea2\\\",\\\"path\\\":\\\"/\\\",\\\"scope\\\":\\\"domain\\\",\\\"categories\\\":[66,68],\\\"confidence\\\":0.70767,\\\"tier\\\":2}\\n{\\\"domain\\\":\\\"7920da2f3dc7de646d3434d467ffccdd8ca31115c54529bc2a4d758896ae1a19\\\",\\\"path\\\":\\\"/\\\",\\\"scope\\\":\\\"domain\\\",\\\"categories\\\":[4,74],\\\"confidence\\\":0.8198000000000001,\\\"tier\\\":2}\\n{\\\"domain\\\":\\\"08a4027677824509beee405482a5f1a5f4feddbf0aafbf4b649c2909732f6909\\\",\\\"path\\\":\\\"/\\\",\\\"scope\\\":\\\"domain\\\",\\\"categories\\\":[75],\\\"confidence\\\":0.8198000000000001,\\\"tier\\\":2}\\n\"\n")

subst = "\\g<1>"

result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result.startswith('"') and result.endswith('"'):
    result = result[1:-1]
if result:
    print (result)
    object_result = json.loads(result)
    print(object_result)
  • Related