Splitting a file based on tags - grep, sed?-CodePudding

I have a file that consists of tags and content descriptions, e.g.:

@ABC-1111 @ANYTAG
Content: description
content1
content2
@ABC-2222 @ABC-0000 @ANYTAG
Content: another description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2
@ABC-0000
Content: anything here
content1
content2

I would like to split this file based on the tags it contains with a certain prefix (e.g "ABC") along with its content below it. So from the example file above, it will be split into 3 files (since there are 3 tags with "ABC" prefix).

File "ABC-0000" (found 3 instances in the file):

@ABC-2222 @ABC-0000 @ANYTAG
Content: another description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2
@ABC-0000
Content: anything here
content1
content2

File "ABC-1111" (found two instances in the file):

@ABC-1111 @ANYTAG
Content: description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2

File "ABC-2222" (found 1 instance in the file):

@ABC-2222 @ABC-1010 @ANYTAG
Content: another description
content1
content2

I was trying to use bash script with sed:

  for i in $(grep -Eo ‘@ABC-[0-9] ’ $file | sort -u); do
    sed -n -r "/${i}/,/^\s*$/p" $file >> $i.out
  done

seems it works only if there is a blank line between the content of the tag with the next tag line.

Is there a way to do this with grep, sed, or awk? Or maybe in python?

Thanks!!

CodePudding user response：

You can use csplit to split up the files into sections with their tags:

csplit --quiet -f xx ./input.txt '/^@/' '{*}'
TAGS=$(grep -o '@[^ ]*' ./input.txt | sort | uniq)
for TAG in $TAGS
do
  grep -l $TAG xx* | xargs cat > $(echo $TAG | tr -d '@')
done

CodePudding user response：

Assumptions:

all tag lines start with a @ in column #1
all lines that start with @ in column #1 are tag lines

One awk idea:

awk '
$1 ~ /^@/ { delete flist                     # delete array of output files
            for (i=1;i<=NF;i  ) {            # loop through list of tags
                if ($i ~ "^@ABC-") {         # if tag starts with "@ABC-" then ..
                   flist[substr($i,2)]       # strip off the "@" and save result as name of an output file
                }
            }
          }

          { for (file in flist)              # for each file in our array ...
                print $0 >> file             # append the current line
          }
' tag.dat

NOTES:

as currently coded awk will maintain an open file descriptor for each tag/file processed
for a smallish number of tags/files this likely won't be a problem for most awk implementations
if running GNU awk you should be able to maintain a sizeable number of open file descriptors
if receiving a message that awk has exceeded the max number of open file descriptors there are a couple ideas that come to mind:
- before delete files run for (file in flist) close(file); this will likely slow down the overall speed of the script due to an excessive number of open/close file operations
- store each tag's data in memory (there are a few ways to do this) and in END {...} processing loop through a master list of tags, performing a single open/write-all-data-from-memory/close operation for each tag; assumes the entire file can fit in memory

Results:

for f in ABC-*
do
    printf "\n############# $f\n"
    cat $f
done

############# ABC-0000
@ABC-2222 @ABC-0000 @ANYTAG
Content: another description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2
@ABC-0000
Content: anything here
content1
content2

############# ABC-1111
@ABC-1111 @ANYTAG
Content: description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2

############# ABC-2222
@ABC-2222 @ABC-0000 @ANYTAG
Content: another description
content1
content2