Home > Enterprise >  Bash scripting- Detecting data type from each column on a csv file
Bash scripting- Detecting data type from each column on a csv file

Time:05-23

I have a csv file with columns. I would like, if possible, to somehow extract/detect the data type of each column.

The header of the file is:

EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc

And the first 5 lines of data (to help detecting the data type)

1985,1,1,10/26/1984,21,252,21.6899384,CHI,1,WSB,1,16,1,40,5,16,0.313,0,0,,6,7,0.857,1,5,6,7,2,4,5,2,16,12.5
1985,2,2,10/27/1984,21,253,21.69267625,CHI,0,MIL,0,-2,1,34,8,13,0.615,0,0,,5,5,1,3,2,5,5,2,1,3,4,21,19.4
1985,3,3,10/29/1984,21,255,21.69815195,CHI,1,MIL,1,6,1,34,13,24,0.542,0,0,,11,13,0.846,2,2,4,5,6,2,3,4,37,32.9
1985,4,4,10/30/1984,21,256,21.7008898,CHI,0,KCK,1,5,1,36,8,21,0.381,0,0,,9,9,1,2,2,4,5,3,1,6,5,25,14.7
1985,5,5,11/1/1984,21,258,21.7063655,CHI,0,DEN,0,-16,1,33,7,15,0.467,0,0,,3,4,0.75,3,2,5,5,1,1,2,4,17,13.2

As you can see most of them are integer, there are 2 string, 1 date and a few float.

I've seen somewhere it was possible with regex, but I have no knowledge as to how.(It does not need to be with regex)

Could some kind heart help me out? I have search but no documentation or command seems to be existing (or to be common/known)

The expected output (or the data types) are:

EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc
Int,Int,Int,Date,Int,Int,Float,String,Int,String,Int,Int,Int,Int,Int,Int,Float,Int,Int,Float,Int,Int,Float,Int,Int,Int,Int,Int,Int,Int,Int,Int,Float

CodePudding user response:

Here's a simple Awk script to infer types from the second line of the input file.

It will not cope correctly with quoted fields in the CSV file.

awk -F , 'NR==2 {
    sep = ""
    for(i=1; i<=NF; i  ) {
        if ($i ~ /^[0-9] $/) type="int"
        else if ($i ~ /^[0-9] \.[0-9]*$|^[0-9]*\.[0-9] $/) type="float"
        else if ($i ~ /^[0-9]{2}\/[0-9]{2}\/[0-9]{4}$/) type="date"
        else type="str"
        printf "%s%s", sep, type
        sep = ","
    }
    printf "\n"
    exit
}' file.csv

Demo: https://ideone.com/Hn7DcV

  • Related