Data types from csv columns using sed and regex-CodePudding

I have to resolve a problem for the university. I have to use a bash script to indicate the type of data of each column in a csv.

This is the header and the 1st line of the csv

num_expediente;fecha;hora;localizacion;numero;cod_distrito;distrito;tipo_accidente;estado_meteorológico;tipo_vehiculo;tipo_persona;rango_edad;sexo;cod_lesividad;lesividad;coordenada_x_utm;coordenada_y_utm;positiva_alcohol;positiva_droga
2022S000001;01/01/2022;1:30:00;AVDA. ALBUFERA, 19;19;13;PUENTE DE VALLECAS;Alcance;Despejado;Turismo;Conductor;De 18 a 30 años;Mujer;NULL;NULL;443359,226;4472082,272;N;NULL

I am using a sed command to determine the type of data:

sed -rn '1p
    2s/[[:blank:]]//g
    2s/([[:digit:]] :[[:digit:]] :[[:digit:]] )/(time)/g
    2s/([[:digit:]] \/[[:digit:]] \/[[:digit:]] )/(date)/g
    2s/([[:digit:]] ,[[:digit:]] )/(decimal)/g
    2s/([[:digit:]] [[:alpha:]] [[:digit:]] )/(string)/g
    2s/([a-zA-Z] \.[a-zA-Z] ,[0-9] )/(string)/g
    2s/([[:digit:]] )/(int)/gp' < $file

But not all the fields are populated with the right type of data. This is the output I get

num_expediente;fecha;hora;localizacion;numero;cod_distrito;distrito;tipo_accidente;estado_meteorológico;tipo_vehiculo;tipo_persona;rango_edad;sexo;cod_lesividad;lesividad;coordenada_x_utm;coordenada_y_utm;positiva_alcohol;positiva_droga
(string);(date);(time);(string);(int);(int);PUENTEDEVALLECAS;Alcance;Despejado;Turismo;Conductor;De(string)años;Mujer;NULL;NULL;(decimal);(decimal);N;NULL

How can I make the bold fields are replace with (string)

I have used:

2s/([^(string)][a-zA-Z] )/(string)/g

But the output is totally wrong

num_expediente;fecha;hora;localizacion;numero;cod_distrito;distrito;tipo_accidente;estado_meteorológico;tipo_vehiculo;tipo_persona;rango_edad;sexo;cod_lesividad;lesividad;coordenada_x_utm;coordenada_y_utm;positiva_alcohol;positiva_droga
(string);((string));(ti(string));(string);(int);(int)(string)(string)(string)(string)(string)(string)(string)(string)(string)(string)(string);((string));((string))(string)(string)

The expected output is:

(string);(date);(time);(string);(int);(int);(string);(string);(string);(string);(string);(string);(string);(string);(string);(decimal);(decimal);(string);(string)

Thanks in advance

CodePudding user response：

Awk will be a better choice as it can easily split the line into fields to examine the fields one by one:

awk -F';' -v OFS=';' '
    NR>1 {
        for (i = 1; i <= NF; i  ) {
            if (sub(/^[0-9] :[0-9] :[0-9] $/, "(time)", $i));
            else if (sub(/^[0-9] \/[0-9] \/[0-9] $/, "(date)", $i));
            else if (sub(/^[0-9] ,[0-9] $/, "(decimal)", $i));
            else if (sub(/^[0-9] $/, "(int)", $i));
            else if (sub(/^.*[a-zA-Z].*$/, "(string)", $i));
        }
    }
1' input_file

Output:

num_expediente;fecha;hora;localizacion;numero;cod_distrito;distrito;tipo_accidente;estado_meteorologico;tipo_vehiculo;tipo_persona;rango_edad;sexo;cod_lesividad;lesividad;coordenada_x_utm;coordenada_y_utm;positiva_alcohol;positiva_droga
(string);(date);(time);(string);(int);(int);(string);(string);(string);(string);(string);(string);(string);(string);(string);(decimal);(decimal);(string);(string)

The if .. else .. syntax will be necessary; otherwise once replaced string, such as (time), may be replaced by (string) again.