I have a file that consists of tags and content descriptions, e.g.:
@ABC-1111 @ANYTAG
Content: description
content1
content2
@ABC-2222 @ABC-0000 @ANYTAG
Content: another description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2
@ABC-0000
Content: anything here
content1
content2
I would like to split this file based on the tags it contains with a certain prefix (e.g "ABC") along with its content below it. So from the example file above, it will be split into 3 files (since there are 3 tags with "ABC" prefix).
File "ABC-0000" (found 3 instances in the file):
@ABC-2222 @ABC-0000 @ANYTAG
Content: another description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2
@ABC-0000
Content: anything here
content1
content2
File "ABC-1111" (found two instances in the file):
@ABC-1111 @ANYTAG
Content: description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2
File "ABC-2222" (found 1 instance in the file):
@ABC-2222 @ABC-1010 @ANYTAG
Content: another description
content1
content2
I was trying to use bash script with sed:
for i in $(grep -Eo ‘@ABC-[0-9] ’ $file | sort -u); do
sed -n -r "/${i}/,/^\s*$/p" $file >> $i.out
done
seems it works only if there is a blank line between the content of the tag with the next tag line.
Is there a way to do this with grep, sed, or awk? Or maybe in python?
Thanks!!
CodePudding user response:
You can use csplit to split up the files into sections with their tags:
csplit --quiet -f xx ./input.txt '/^@/' '{*}'
TAGS=$(grep -o '@[^ ]*' ./input.txt | sort | uniq)
for TAG in $TAGS
do
grep -l $TAG xx* | xargs cat > $(echo $TAG | tr -d '@')
done
CodePudding user response:
Assumptions:
- all tag lines start with a
@
in column #1 - all lines that start with
@
in column #1 are tag lines
One awk
idea:
awk '
$1 ~ /^@/ { delete flist # delete array of output files
for (i=1;i<=NF;i ) { # loop through list of tags
if ($i ~ "^@ABC-") { # if tag starts with "@ABC-" then ..
flist[substr($i,2)] # strip off the "@" and save result as name of an output file
}
}
}
{ for (file in flist) # for each file in our array ...
print $0 >> file # append the current line
}
' tag.dat
NOTES:
- as currently coded
awk
will maintain an open file descriptor for each tag/file processed - for a smallish number of tags/files this likely won't be a problem for most
awk
implementations - if running
GNU awk
you should be able to maintain a sizeable number of open file descriptors - if receiving a message that
awk
has exceeded the max number of open file descriptors there are a couple ideas that come to mind:- before
delete files
runfor (file in flist) close(file)
; this will likely slow down the overall speed of the script due to an excessive number of open/close file operations - store each tag's data in memory (there are a few ways to do this) and in
END {...}
processing loop through a master list of tags, performing a singleopen/write-all-data-from-memory/close
operation for each tag; assumes the entire file can fit in memory
- before
Results:
for f in ABC-*
do
printf "\n############# $f\n"
cat $f
done
############# ABC-0000
@ABC-2222 @ABC-0000 @ANYTAG
Content: another description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2
@ABC-0000
Content: anything here
content1
content2
############# ABC-1111
@ABC-1111 @ANYTAG
Content: description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2
############# ABC-2222
@ABC-2222 @ABC-0000 @ANYTAG
Content: another description
content1
content2