I try to use awk
to filter some rows based on whether a given column contains an integer, floating point or empty string.
I have problems with the integers or the way to add them to the rule but avoiding adding weird expressions like 2.e
.
INPUT:
220802,2249,3
220802,2250,37.5
220802,2251,2.e
220802,2252,28
220802,2253,
220802,2254,19.5
220802,2255,3.a
220802,2256,123.25
220802,2257,1.32e-8
220802,2258,N/A
EXPECTED OUTPUT:
220802,2249,3
220802,2250,37.5
220802,2252,28
220802,2253,
220802,2254,19.5
220802,2256,123.25
220802,2257,1.32e-8
I try this, but fails:
awk -F , '$3 ~ /(^$|[0-9] \.?[0-9] e?[- ]?[0-9]? )/' INPUT
220802,2250,37.5
220802,2252,28
220802,2253,
220802,2254,19.5
220802,2256,123.25
220802,2257,1.32e-8
Also try this:
awk -F , '$3 ~ /(^$|[0-9] |[0-9] \.?[0-9] e?[- ]?[0-9]? )/'
220802,2249,3
220802,2250,37.5
220802,2251,2.e
220802,2252,28
220802,2253,
220802,2254,19.5
220802,2255,3.a
220802,2256,123.25
220802,2257,1.32e-8
Any aproach to consider? Thanks in advance!
CodePudding user response:
for the shown samples :
mawk '$NF <= $NF' FS=','
220802,2249,3
220802,2250,37.5
220802,2252,28
220802,2253,
220802,2254,19.5
220802,2256,123.25
220802,2257,1.32e-8
CodePudding user response:
You can easily check for empty.
You can check if something is a number by comparing it to itself plus zero.
So:
awk -F, '!$3 || (0 $3 == $3)' INPUT
The reason your regex didn't work is that you didn't anchor it. For example, [0-9]
matches any substring. An alternative that may match what you intended is:
/^([0-9] (\.[0-9] )?(e?[- ]?[0-9] )?)?$/
CodePudding user response:
To use AWK,
awk -F , '$3 ~ /(^$|[0-9] \.?[0-9] e?[- ]?[0-9]? )|([0-9] $)/' numbers.csv
To use rquery (it's easy to understand the filters. https://github.com/fuyuncat/rquery/releases)
./rq -q "p d/,/ | s @raw | f @%=2 or isnull(@3)=1 or datatype(@3)='LONG' or datatype(@3)='DOUBLE'" numbers.csv
220802,2249,3
220802,2250,37.5
220802,2252,28
220802,2253,
220802,2254,19.5
220802,2256,123.25
220802,2257,1.32e-8