I have to resolve a problem for the university. I have to use a bash script to indicate the type of data of each column in a csv.
This is the header and the 1st line of the csv
num_expediente;fecha;hora;localizacion;numero;cod_distrito;distrito;tipo_accidente;estado_meteorológico;tipo_vehiculo;tipo_persona;rango_edad;sexo;cod_lesividad;lesividad;coordenada_x_utm;coordenada_y_utm;positiva_alcohol;positiva_droga
2022S000001;01/01/2022;1:30:00;AVDA. ALBUFERA, 19;19;13;PUENTE DE VALLECAS;Alcance;Despejado;Turismo;Conductor;De 18 a 30 años;Mujer;NULL;NULL;443359,226;4472082,272;N;NULL
I am using a sed command to determine the type of data:
sed -rn '1p
2s/[[:blank:]]//g
2s/([[:digit:]] :[[:digit:]] :[[:digit:]] )/(time)/g
2s/([[:digit:]] \/[[:digit:]] \/[[:digit:]] )/(date)/g
2s/([[:digit:]] ,[[:digit:]] )/(decimal)/g
2s/([[:digit:]] [[:alpha:]] [[:digit:]] )/(string)/g
2s/([a-zA-Z] \.[a-zA-Z] ,[0-9] )/(string)/g
2s/([[:digit:]] )/(int)/gp' < $file
But not all the fields are populated with the right type of data. This is the output I get
num_expediente;fecha;hora;localizacion;numero;cod_distrito;distrito;tipo_accidente;estado_meteorológico;tipo_vehiculo;tipo_persona;rango_edad;sexo;cod_lesividad;lesividad;coordenada_x_utm;coordenada_y_utm;positiva_alcohol;positiva_droga
(string);(date);(time);(string);(int);(int);PUENTEDEVALLECAS;Alcance;Despejado;Turismo;Conductor;De(string)años;Mujer;NULL;NULL;(decimal);(decimal);N;NULL
How can I make the bold fields are replace with (string)
I have used:
2s/([^(string)][a-zA-Z] )/(string)/g
But the output is totally wrong
num_expediente;fecha;hora;localizacion;numero;cod_distrito;distrito;tipo_accidente;estado_meteorológico;tipo_vehiculo;tipo_persona;rango_edad;sexo;cod_lesividad;lesividad;coordenada_x_utm;coordenada_y_utm;positiva_alcohol;positiva_droga
(string);((string));(ti(string));(string);(int);(int)(string)(string)(string)(string)(string)(string)(string)(string)(string)(string)(string);((string));((string))(string)(string)
The expected output is:
(string);(date);(time);(string);(int);(int);(string);(string);(string);(string);(string);(string);(string);(string);(string);(decimal);(decimal);(string);(string)
Thanks in advance
CodePudding user response:
Awk
will be a better choice as it can easily split the line into fields
to examine the fields one by one:
awk -F';' -v OFS=';' '
NR>1 {
for (i = 1; i <= NF; i ) {
if (sub(/^[0-9] :[0-9] :[0-9] $/, "(time)", $i));
else if (sub(/^[0-9] \/[0-9] \/[0-9] $/, "(date)", $i));
else if (sub(/^[0-9] ,[0-9] $/, "(decimal)", $i));
else if (sub(/^[0-9] $/, "(int)", $i));
else if (sub(/^.*[a-zA-Z].*$/, "(string)", $i));
}
}
1' input_file
Output:
num_expediente;fecha;hora;localizacion;numero;cod_distrito;distrito;tipo_accidente;estado_meteorologico;tipo_vehiculo;tipo_persona;rango_edad;sexo;cod_lesividad;lesividad;coordenada_x_utm;coordenada_y_utm;positiva_alcohol;positiva_droga
(string);(date);(time);(string);(int);(int);(string);(string);(string);(string);(string);(string);(string);(string);(string);(decimal);(decimal);(string);(string)
The if .. else ..
syntax will be necessary; otherwise once replaced string,
such as (time)
, may be replaced by (string)
again.