Home > OS >  Escaping newline character inside JSON string with sed, but preserving newline between the actual li
Escaping newline character inside JSON string with sed, but preserving newline between the actual li

Time:01-18

I'm trying to replace newline characters in a JSON string, because Postgres' COPY function won't work otherwise.

I've been trying different combinations of sed, such as:

gsed -z 's/\\n/\\\\n/g'
gsed -E ':a;N;$!ba;s/\r{0,1}\n/\\n/g'
gsed '/}$/{N; s/}\n{/},\n{/}

But it either replaces all of them or none.

Here's my sample data:

{"corpusid":75399585,"externalids":{"ACL":null,"DBLP":null,"ArXiv":null,"MAG":"2350229292","CorpusId":"75399585","PubMed":null,"DOI":null,"PubMedCentral":null},"url":"https://www.semanticscholar.org/paper/aec9799d6e12f2000deb8f7dadcc7d7e653f7393","title":"Prevention and Therapy of Osteoporosis","authors":[{"authorId":"40937034","name":"Hu Jun-bo"}],"venue":"","publicationvenueid":null,"year":2003,"referencecount":0,"citationcount":0,"influentialcitationcount":0,"isopenaccess":false,"s2fieldsofstudy":[{"category":"Medicine","source":"s2-fos-model"},{"category":"Medicine","source":"external"}],"publicationtypes":null,"publicationdate":null,"journal":null,"updated":"2022-02-11T07:19:53.877Z"}
{"corpusid":83138554,"openaccessinfo":{"externalids":{"MAG":"883546316","ACL":null,"DOI":null,"PubMedCentral":null,"ArXiv":null},"license":null,"url":null,"status":null},"abstract":"Phytosulfokin-alpha (PSK-alpha) ist ein bisulfatiertes Pentapetid, welches als Wachstumsfaktor die Proliferation von Zellen in Zellkultur niedriger Dichte stimuliert. In der vorliegenden Arbeit wurden physiologische und genetische Ansatze verfolgt, um Funktionen von PSK-alpha in Arabidopsis thaliana aufzuklaren. Promotor-GUS Expressionsanalysen der PSK-Praproproteingene und des PSK-Rezeptorgens AtPSKR1 in Wurzeln und Bluten deuteten auf mogliche Funktionen von PSK-alpha in diesen Geweben hin. \nExogen appliziertes PSK-alpha forderte das Wurzelwachstum von Arabidopsiskeimlingen. Das Wurzelwachstum der PSK-Rezeptor T-DNA Insertionsmutante Atpskr1-T war inhibiert. Diese Ergebnisse zeigten, dass PSK-alpha in die Regulation von Wurzelwachstum involviert ist. \nCharakteristisch fur die Atpskr1-T Mutante war eine verringerte Samenbildung im Vergleich zu Wildtyppflanzen. Genetische Analysen zeigten, dass sowohl ein paternaler als auch maternale Effekte und asynchrones Wachstum der inneren Blutenorgane zur Ausbildung des Mutantenphanotyps fuhren.","updated":"2022-02-11T01:27:48.987Z"}

As you can see, second JSON string has \n within the "abstract" field, it needs to be replaced to \\n I assume for the COPY to work.

But I have to retain newline in between JSON strings, as there's thousands of them, so that COPY reads that file properly.

CodePudding user response:

A workaround without sed:

head -n 1 file > new_file
tail -n 1 file | jq -c '.abstract |= (gsub("\n"; "\\n"))' >> new_file

CodePudding user response:

I have reused the portion with jq by Cyrus and wrapped it into a sh script, as I had many files I needed to convert. Please note it also replaces \" with " as it was preventing COPY function from properly importing JSON strings.

Here's the result:

#!/bin/bash

input_file=$1
output_file=$2

while read -r line; do
  if echo "$line" | jq -e 'has("abstract")' > /dev/null; then
    echo "$line" | jq -c 'try .abstract |= (gsub("\\n"; "\\n") | (gsub("\""; """)))' | tee -a $output_file > /dev/null
  else
    echo "$line" | tee -a $output_file > /dev/null
  fi
done < $input_file

tail -n 1 $input_file | jq -c 'try .abstract |= (gsub("\\n"; "\\n") | (gsub("\""; "&quot;")))' >> $output_file > /dev/null

Usage:

./script_name.sh original_file.json converted_file.json

  • Related