Home > Back-end >  Splitting a file based on tags - grep, sed?
Splitting a file based on tags - grep, sed?

Time:02-15

I have a file that consists of tags and content descriptions, e.g.:

@ABC-1111 @ANYTAG
Content: description
content1
content2
@ABC-2222 @ABC-0000 @ANYTAG
Content: another description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2
@ABC-0000
Content: anything here
content1
content2

I would like to split this file based on the tags it contains with a certain prefix (e.g "ABC") along with its content below it. So from the example file above, it will be split into 3 files (since there are 3 tags with "ABC" prefix).

File "ABC-0000" (found 3 instances in the file):

@ABC-2222 @ABC-0000 @ANYTAG
Content: another description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2
@ABC-0000
Content: anything here
content1
content2

File "ABC-1111" (found two instances in the file):

@ABC-1111 @ANYTAG
Content: description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2

File "ABC-2222" (found 1 instance in the file):

@ABC-2222 @ABC-1010 @ANYTAG
Content: another description
content1
content2

I was trying to use bash script with sed:

  for i in $(grep -Eo ‘@ABC-[0-9] ’ $file | sort -u); do
    sed -n -r "/${i}/,/^\s*$/p" $file >> $i.out
  done

seems it works only if there is a blank line between the content of the tag with the next tag line.

Is there a way to do this with grep, sed, or awk? Or maybe in python?

Thanks!!

CodePudding user response:

You can use csplit to split up the files into sections with their tags:

csplit --quiet -f xx ./input.txt '/^@/' '{*}'
TAGS=$(grep -o '@[^ ]*' ./input.txt | sort | uniq)
for TAG in $TAGS
do
  grep -l $TAG xx* | xargs cat > $(echo $TAG | tr -d '@')
done

CodePudding user response:

Assumptions:

  • all tag lines start with a @ in column #1
  • all lines that start with @ in column #1 are tag lines

One awk idea:

awk '
$1 ~ /^@/ { delete flist                     # delete array of output files
            for (i=1;i<=NF;i  ) {            # loop through list of tags
                if ($i ~ "^@ABC-") {         # if tag starts with "@ABC-" then ..
                   flist[substr($i,2)]       # strip off the "@" and save result as name of an output file
                }
            }
          }

          { for (file in flist)              # for each file in our array ...
                print $0 >> file             # append the current line
          }
' tag.dat

NOTES:

  • as currently coded awk will maintain an open file descriptor for each tag/file processed
  • for a smallish number of tags/files this likely won't be a problem for most awk implementations
  • if running GNU awk you should be able to maintain a sizeable number of open file descriptors
  • if receiving a message that awk has exceeded the max number of open file descriptors there are a couple ideas that come to mind:
    • before delete files run for (file in flist) close(file); this will likely slow down the overall speed of the script due to an excessive number of open/close file operations
    • store each tag's data in memory (there are a few ways to do this) and in END {...} processing loop through a master list of tags, performing a single open/write-all-data-from-memory/close operation for each tag; assumes the entire file can fit in memory

Results:

for f in ABC-*
do
    printf "\n############# $f\n"
    cat $f
done

############# ABC-0000
@ABC-2222 @ABC-0000 @ANYTAG
Content: another description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2
@ABC-0000
Content: anything here
content1
content2

############# ABC-1111
@ABC-1111 @ANYTAG
Content: description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2

############# ABC-2222
@ABC-2222 @ABC-0000 @ANYTAG
Content: another description
content1
content2
  • Related