Home > Software engineering >  How to remove spaces between tags in a delimited file?
How to remove spaces between tags in a delimited file?

Time:02-17

I have this table dump from a MySQL system, and although it follows RFC standards, it appears to have added unwanted space in columns where HTML text are stored. For example:

   "2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC 03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">"
       <div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">"

This is one out of about 30K rows, so I am trying to figure out a smart way to remove the space between the " and <div (and possibly others) here. I tried out:

awk '{$1=$1;printf $0}' 

And this kind of works, but it mashes everything into one line which is not what I want. I would like to preserve the line breaks in the CSV dump. I am very curious to hear your ideas on how to tackle this.

CodePudding user response:

The following using GNU awk for multi-char RS, RT, and gensub() will work even if your input file is huge as it doesn't read the whole file into memory, it just reads the strings separated by "<spaces>< or newline one at a time:

$ awk -v RS='"\\s <|\n' '{printf "%s%s", $0, gensub(/"\s </,"\"<",1,RT)}' file
   "2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC 03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">"<div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">"

I'm assuming that when you say and possibly others in your question you mean other case like "<spaces><div> where there's a " then spaces then a tag starting with < but that's obviously just a guess.

CodePudding user response:

With your shown samples only, please try following awk code. Written and tested in GNU awk. Simple explanation would be, setting RS(record separator) as null and in main program, globally substituting new lines followed by spaces followed by <div with <div in lines and printing the lines by awkish way by using 1.

awk -v RS="" '{gsub(/\n [[:space:]] <div/,"<div")} 1' Input_file

CodePudding user response:

Assuming your request is to remove the space before the start of the <div tag, you can try this GNU sed

$ sed -z 's/\(\"\)[[:space:]]\ \(<div .*\)/\1\n\2/' input_file
   "2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC 03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">"
<div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">"

CodePudding user response:

You can do this with perl:

perl -0777 -i -pe 's/"\K\s (?=<div)//g' file

Details

  • 0777 slurps the file into a single string so that the pattern could match line break sequences
  • -i - file inline replacement is on
  • "\K\s (?=<div) - matches a " char that is dropped from the match value with \K, then one or more whitespaces are consumed (with \s ) and then <div must follow immediately and the match is replaced with an empty string
  • g replaces all occurrences.

You can achieve the same with a GNU sed:

sed -i -Ez 's/"\s <div/"<div/g' file

where -i enables inplace file replacement and -E enables the POSIX ERE regex syntax, and z pulls the file text into pattern space where line breaks are "visible" for the regex pattern.

  • Related