Home > Blockchain >  using sed to omit specific lines of a dataset
using sed to omit specific lines of a dataset

Time:04-07

I have a dataset with data separated by commas, here is an example:

id, date of birth, grade, explusion, serious misdemeanor, info
123,2005-01-01,5.36,1,1, 
582,1999-05-12,8.51,0,1
9274,2001-25-12,9.65,0,0,pass
21,2006-14-05,0.53,4,1,repeat

The case, is that I need to implement a regular expression using sed to remove all those records from the student dataset that do not have any explusion nor a serious misdemeanor. So the result of executing the command would be the third register of the previous sample.

sed -i "/^*,*,*,0,0$/d" file.csv

Any idea of what's missing?

CodePudding user response:

You might want to use awk to check Fields 4 and 5, and only return line where they are not 0:

awk -F, '$4 != 0 ||  $5 != 0' file.csv > output.csv

Or, to get the other lines:

awk -F, '$4 == 0 &&  $5 == 0' file.csv > output.csv

See the online demo.

You can also use

sed -i '/,0,0$/d' file.csv

With this, you will remove all lines ending with ,0,0.

See the online demo:

#!/bin/bash
s='id, date of birth, grade, explusion, serious misdemeanor
123,2005-01-01,5.36,1,1
582,1999-05-12,8.51,0,1
9274,2001-25-12,9.65,0,0
21,2006-14-05,0.53,4,1'
sed '/,0,0$/d' <<< "$s"

Output:

id, date of birth, grade, explusion, serious misdemeanor
123,2005-01-01,5.36,1,1
582,1999-05-12,8.51,0,1
21,2006-14-05,0.53,4,1

To see the other lines, use a reverse command like

sed  -i -n '/,0,0$/p' file.csv

It will print the lines that end with ,0,0.

CodePudding user response:

You seem to think * means "anything" but it means "repeat the previous regular expression zero or more times, as many as possible". Regular expressions are different from wildcards as used in many shells and search engines, where * often does mean "any string".

The regular expression .* means "any character at all, repeated as many times as possible" but in this case you clearly mean [^,]* which means "any character which isn't a comma, repeated as many times as possible."

However, sed will happily match on a substring, so just

sed -i '/,0,0$/d' file.csv

should work, or equivalently

grep -v ',0,0$' file.csv >temp && mv temp file.csv

CodePudding user response:

Using sed

$ sed 's/,/&#/3;/#0/d;s/,/&#/4;/#0/d;s/#//g' input_file
id, date of birth, grade, explusion, serious misdemeanor, info
123,2005-01-01,5.36,1,1,
21,2006-14-05,0.53,4,1,repeat

Match the third/fourth occurance of a comma and place a marker in all lines. If the marker has a 0 beside it, then it matches as a field with no expulsion or serious misdemeanor and is deleted.

  • Related