Home > other >  How can I split files by grouping the same lines?
How can I split files by grouping the same lines?

Time:08-18

How can I split files by grouping the same lines using shell script or awk?

For example, I have 1 file with the content as follow:

1,1,1,1
2,2,2,2
3,3,3,3
x,x,x,x
x,x,x,x
x,x,x,x
x,x,x,x
y,y,y,y
y,y,y,y
y,y,y,y
4,4,4,4
5,5,5,5

What I want is: all the equal lines are a group and must to be in a separated file, the other different lines needs to be in a splited file until specific limit. For example, if I have specific limit as 10, then the original file must to be splited for all lines containing numbers until the limit of 10 (<= 10), if there are more different lines than the limit, create another splited file and so on.

For the equal lines containing letters I need them to have their own separate file. So one file only for x,x,x,x lines, other for y,y,y,y lines and so on.

The content of lines is just example, the real case is a CSV containing different values for all columns where I need to group by specific column value (I'm using sort and uniq for this), but anyway I need to split this csv by equal lines group and by different lines <= limit using shell script or awk (I see awk provides better performance).

Do you have any idea?

My current code is (it keeps the first line because I'm considering the csv has a header):

#!/bin/bash
COLUMN=$1
FILE=$2
LIMIT=$3
FILELENGTH=`wc -l < $FILE`
COUNTER=$LIMIT
NUMS=""
SORTED="sorted_"`basename $FILE`

sort -t, -k $COLUMN -n $FILE > $SORTED
while [ $COUNTER -le $FILELENGTH ]; do
        NUMS =`uniq -c $SORTED | awk -v val=$COUNTER '($1 prev)<=val {prev =$1} END{print prev}'`
        NUMS =" "
        ((COUNTER =LIMIT))
        echo $NUMS "|" $COUNTER "|" $FILELENGTH "|" $SORTED
done

awk -v nums="$NUMS" -v fname=`basename $2` -v dname=`dirname $2` '
   NR==1 { header=$0; next}
   (NR-1)==1 {
        c=split(nums,b)
        for(i=1; i<=c; i  ) a[b[i]]
        j=1; out = dname"/" "splited" j "_"fname
        print header > out
        system("touch "out".fin")
    }
    { print > out }
    NR in a {
        close(out)
        out = dname "/" "splited"   j "_"fname
        print header > out
        system("touch "out".fin")
    }' $SORTED

CodePudding user response:

With GNU awk you could try following code, written as per your shown samples. With a 2 pass of Input_file here. For lines which are occurring more than once in Input_file their output file will be created with name eg: firstfieldValue.outFile and files which are unique(having only 1 occurrence in your Input_file) will be created with name like: 1.singleOccurrence.outFile, 2.singleOccurrence.outFile and so on.

awk '
BEGIN{
  count1="1"
  FS=OFS=","
}
FNR==NR{
  arr[$0]  
  next
}
arr[$0]>1{
  print > ($1".outFile")
  next
}
{
  count1 =(  count2==0?1:0)
  print > (count1".singleOccurrence.outFile")
}
'  Input_file  Input_file


OR to keep headers(very first line of your Input_file) into each output file, please try following awk code, little tweak in above code:

awk '
BEGIN{
  count1="1"
  FS=OFS=","
}
FNR==1{ headers = $0; next }
FNR==NR && FNR>1{
  arr[$0]  
  next
}
arr[$0]>1{
  if(!arr1[$0]  ){ print headers > ($1".outFile") }
  print > ($1".outFile")
  next
}
{
  count1 =(  count2==0?1:0)
  if(prev!=count1){print headers > count1".singleOccurrence.outFile"}
  print > (count1".singleOccurrence.outFile")
  prev=count1
}
'  Input_file  Input_file

CodePudding user response:

 awk -F, -v limit=3 '
    BEGIN{i=1}
    NR==1{
        header=$0                                       # save the header
        next                                            # go to next line
    }
    FNR==NR{                                            # process letters-lines
        if(f!=$0) print header " > " "tmp/file_" $1     # print initial header      
        f=$0                                            # save line
        print $0 " > " "tmp/file_" $1                   # print line to file
        next                                            # go to next line
    }
    {                                                   # process numbers-lines    
        if (x!=i) print header " > " "tmp/file_" i      # print initial header
        x=i                                             # save number    
        print $0 " > " "tmp/file_" i                    # print line to file    
    }
    FNR % limit == 0{                                   # check limit 
        i  
    }
' <(head -n 1 split.csv;                                # getting the header
    grep "^[a-Z]" <(sed '1d' split.csv)|sort            # getting sorted letters-lines
   ) \
  <(grep "^[^a-Z]" <(sed '1d' split.csv))               # getting numbers-lines


$ head tmp/*
==> tmp/file_1 <==
header
1,1,1,1
2,2,2,2
3,3,3,3

==> tmp/file_2 <==
header
4,4,4,4
5,5,5,5

==> tmp/file_x <==
header
x,x,x,x
x,x,x,x
x,x,x,x
x,x,x,x

==> tmp/file_y <==
header
y,y,y,y
y,y,y,y
y,y,y,y
  • Related