I have a huge comma-separated file formatted like this
name,account,1/2022,2/2022,3/2022
row1,1234,0,1,2
row2,5678,3,4,5
row3,4321,6,7,8
row4,8765,9,10,11
I would like to transpose it in an efficient way using only bash commands (I have used Python Melt and loaded into a db and used unpivot function), but I think both are slower to execute than the native bash or awk solutions
So the output should look like
name,account,date,value
row1,1234,1/2022,0
row1,1234,2/2022,1
row1,1234,3/2022,2
row2,5678,1/2022,3
row2,5678,2/2022,4
row2,5678,3/2022,5
row3,4321,1/2022,6
row3,4321,2/2022,7
row3,4321,3/2022,8
row4,8765,1/2022,9
row4,8765,2/2022,10
row4,8765,3/2022,11
The expected result is in the millions and in Python I would have to chunk and loop the dataframe. Most SQL DB commands can do it, but the unpivot function seems to be single-threaded and thus slow.
Looking for creative solutions in AWK or something native in Ubuntu.
CodePudding user response:
You can try it,
awk 'BEGIN{OFS=FS=","}
NR==1{print $1,$2,"date","value";
for(i=0;i<NF-2; i){date[i]=$(i 3);}
}
NR>1{for(i=0;i<NF-2; i){print $1,$2,date[i],$(i 3)}}
' inputfile
you get,
name,account,date,value
row1,1234,1/2022,0
row1,1234,2/2022,1
row1,1234,3/2022,2
row2,5678,1/2022,3
row2,5678,2/2022,4
row2,5678,3/2022,5
row3,4321,1/2022,6
row3,4321,2/2022,7
row3,4321,3/2022,8
row4,8765,1/2022,9
row4,8765,2/2022,10
row4,8765,3/2022,11
CodePudding user response:
Another variation on Jose's approach which simply adjusts the indexes a bit to be consistent with iterating from 1
to NF
could be:
awk -F, -v OFS="," '
FNR == 1 {
ndates = NF - 2
for (i = ndates; i <= NF; i )
dates[i-2] = $i
print $1, $2, "date,value"
next
}
{
for (i = ndates; i <= NF; i )
print $1, $2, dates[i-2], $i
}
' file
This assumes the number of fields for each record are consistent, but will handle a variable number of fields from field no. 3 on.
Example Use/Output
Copying and middle-mouse pasting the above into an x-term in the directory where your input file
is located would be:
$ awk -F, -v OFS="," '
> FNR == 1 {
> ndates = NF - 2
> for (i = ndates; i <= NF; i )
> dates[i-2] = $i
> print $1, $2, "date,value"
> next
> }
> {
> for (i = ndates; i <= NF; i )
> print $1, $2, dates[i-2], $i
> }
> ' file
name,account,date,value
row1,1234,1/2022,0
row1,1234,2/2022,1
row1,1234,3/2022,2
row2,5678,1/2022,3
row2,5678,2/2022,4
row2,5678,3/2022,5
row3,4321,1/2022,6
row3,4321,2/2022,7
row3,4321,3/2022,8
row4,8765,1/2022,9
row4,8765,2/2022,10
row4,8765,3/2022,11