How can I split files by grouping the same lines using shell script or awk?
For example, I have 1 file with the content as follow:
1,1,1,1
2,2,2,2
3,3,3,3
x,x,x,x
x,x,x,x
x,x,x,x
x,x,x,x
y,y,y,y
y,y,y,y
y,y,y,y
4,4,4,4
5,5,5,5
What I want is: all the equal lines are a group and must to be in a separated file, the other different lines needs to be in a splited file until specific limit. For example, if I have specific limit as 10, then the original file must to be splited for all lines containing numbers until the limit of 10 (<= 10), if there are more different lines than the limit, create another splited file and so on.
For the equal lines containing letters I need them to have their own separate file. So one file only for x,x,x,x lines, other for y,y,y,y lines and so on.
The content of lines is just example, the real case is a CSV containing different values for all columns where I need to group by specific column value (I'm using sort and uniq for this), but anyway I need to split this csv by equal lines group and by different lines <= limit using shell script or awk (I see awk provides better performance).
Do you have any idea?
My current code is (it keeps the first line because I'm considering the csv has a header):
#!/bin/bash
COLUMN=$1
FILE=$2
LIMIT=$3
FILELENGTH=`wc -l < $FILE`
COUNTER=$LIMIT
NUMS=""
SORTED="sorted_"`basename $FILE`
sort -t, -k $COLUMN -n $FILE > $SORTED
while [ $COUNTER -le $FILELENGTH ]; do
NUMS =`uniq -c $SORTED | awk -v val=$COUNTER '($1 prev)<=val {prev =$1} END{print prev}'`
NUMS =" "
((COUNTER =LIMIT))
echo $NUMS "|" $COUNTER "|" $FILELENGTH "|" $SORTED
done
awk -v nums="$NUMS" -v fname=`basename $2` -v dname=`dirname $2` '
NR==1 { header=$0; next}
(NR-1)==1 {
c=split(nums,b)
for(i=1; i<=c; i ) a[b[i]]
j=1; out = dname"/" "splited" j "_"fname
print header > out
system("touch "out".fin")
}
{ print > out }
NR in a {
close(out)
out = dname "/" "splited" j "_"fname
print header > out
system("touch "out".fin")
}' $SORTED
CodePudding user response:
With GNU awk
you could try following code, written as per your shown samples. With a 2 pass of Input_file here. For lines which are occurring more than once in Input_file their output file will be created with name eg: firstfieldValue.outFile
and files which are unique(having only 1 occurrence in your Input_file) will be created with name like: 1.singleOccurrence.outFile
, 2.singleOccurrence.outFile
and so on.
awk '
BEGIN{
count1="1"
FS=OFS=","
}
FNR==NR{
arr[$0]
next
}
arr[$0]>1{
print > ($1".outFile")
next
}
{
count1 =( count2==0?1:0)
print > (count1".singleOccurrence.outFile")
}
' Input_file Input_file
OR to keep headers(very first line of your Input_file) into each output file, please try following awk
code, little tweak in above code:
awk '
BEGIN{
count1="1"
FS=OFS=","
}
FNR==1{ headers = $0; next }
FNR==NR && FNR>1{
arr[$0]
next
}
arr[$0]>1{
if(!arr1[$0] ){ print headers > ($1".outFile") }
print > ($1".outFile")
next
}
{
count1 =( count2==0?1:0)
if(prev!=count1){print headers > count1".singleOccurrence.outFile"}
print > (count1".singleOccurrence.outFile")
prev=count1
}
' Input_file Input_file
CodePudding user response:
awk -F, -v limit=3 '
BEGIN{i=1}
NR==1{
header=$0 # save the header
next # go to next line
}
FNR==NR{ # process letters-lines
if(f!=$0) print header " > " "tmp/file_" $1 # print initial header
f=$0 # save line
print $0 " > " "tmp/file_" $1 # print line to file
next # go to next line
}
{ # process numbers-lines
if (x!=i) print header " > " "tmp/file_" i # print initial header
x=i # save number
print $0 " > " "tmp/file_" i # print line to file
}
FNR % limit == 0{ # check limit
i
}
' <(head -n 1 split.csv; # getting the header
grep "^[a-Z]" <(sed '1d' split.csv)|sort # getting sorted letters-lines
) \
<(grep "^[^a-Z]" <(sed '1d' split.csv)) # getting numbers-lines
$ head tmp/*
==> tmp/file_1 <==
header
1,1,1,1
2,2,2,2
3,3,3,3
==> tmp/file_2 <==
header
4,4,4,4
5,5,5,5
==> tmp/file_x <==
header
x,x,x,x
x,x,x,x
x,x,x,x
x,x,x,x
==> tmp/file_y <==
header
y,y,y,y
y,y,y,y
y,y,y,y