I have some very long text file ( > 300-500 MB) and with thousands of lines, like:
blavbl
[code]
sdasdasd
asdasd
...
[/code]
line X
line Y
etc
...
[code]
...
[/code]
blabla
[code]
[/code]
I want to split the text in pieces that contains string between [code]
and [/code]
, I have the following code that does job (partially) but is very slow:
#!/bin/bash
function split {
file="$1"
start="$2"
end="$3"
nfodata=$(cat "$file")
IFS=$'\n' read -d '' -a nfoarray <<< "$nfodata"
arr=()
start=0
for line in "${nfoarray[@]}"
do
if [[ "$line" =~ ^"$start" ]]; then
arr =("$line")
start=1
continue
fi
if [[ "$line" =~ ^"$end" ]]; then
start=0
break
fi
if [[ $start == 1 ]]; then
arr =("$line")
continue
fi
done
printf "%s\n" "${arr[@]}"
}
split $myfile "[code]" "[/code]"
As I wrote, is very slow, and don't know if is better or faster approach.
The final result want to be an array that contains portion of string between [code] and [/code]
CodePudding user response:
Using sed:
sed '/^\[code\]$/,/^\[\/code\]$/!d;//d'
Using awk:
awk '
/^\[\/code\]$/ {--c} c
/^\[code\]$/ { c}'
Either of these methods require the tag patterns to alternate cleanly - no nested, repeated or unclosed tags.
This prints all lines inside the tags, excluding the tags. Eg:
sdasdasd
asdasd
...
...
<empty line>