I have this table dump from a MySQL system, and although it follows RFC standards, it appears to have added unwanted space in columns where HTML text are stored. For example:
"2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC 03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">"
<div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">"
This is one out of about 30K rows, so I am trying to figure out a smart way to remove the space between the " and <div (and possibly others) here. I tried out:
awk '{$1=$1;printf $0}'
And this kind of works, but it mashes everything into one line which is not what I want. I would like to preserve the line breaks in the CSV dump. I am very curious to hear your ideas on how to tackle this.
CodePudding user response:
The following using GNU awk for multi-char RS, RT, and gensub() will work even if your input file is huge as it doesn't read the whole file into memory, it just reads the strings separated by "<spaces><
or newline one at a time:
$ awk -v RS='"\\s <|\n' '{printf "%s%s", $0, gensub(/"\s </,"\"<",1,RT)}' file
"2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC 03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">"<div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">"
I'm assuming that when you say and possibly others
in your question you mean other case like "<spaces><div>
where there's a "
then spaces then a tag starting with <
but that's obviously just a guess.
CodePudding user response:
With your shown samples only, please try following awk
code. Written and tested in GNU awk
. Simple explanation would be, setting RS
(record separator) as null and in main program, globally substituting new lines followed by spaces followed by <div
with <div
in lines and printing the lines by awk
ish way by using 1
.
awk -v RS="" '{gsub(/\n [[:space:]] <div/,"<div")} 1' Input_file
CodePudding user response:
Assuming your request is to remove the space before the start of the <div
tag, you can try this GNU sed
$ sed -z 's/\(\"\)[[:space:]]\ \(<div .*\)/\1\n\2/' input_file
"2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC 03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">"
<div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">"
CodePudding user response:
You can do this with perl
:
perl -0777 -i -pe 's/"\K\s (?=<div)//g' file
Details
0777
slurps the file into a single string so that the pattern could match line break sequences-i
- file inline replacement is on"\K\s (?=<div)
- matches a"
char that is dropped from the match value with\K
, then one or more whitespaces are consumed (with\s
) and then<div
must follow immediately and the match is replaced with an empty stringg
replaces all occurrences.
You can achieve the same with a GNU sed
:
sed -i -Ez 's/"\s <div/"<div/g' file
where -i
enables inplace file replacement and -E
enables the POSIX ERE regex syntax, and z
pulls the file text into pattern space where line breaks are "visible" for the regex pattern.