Home > Software design >  Count certain character in strings in bigfile
Count certain character in strings in bigfile

Time:08-11

is there a quick linux way to check if there are more than 3 "|" characters are in a line of a text file. with something like; Filtered.txt output.txt

tried some easy bash, but it was way to slow slow calculated the runtime arround 3 weeks

CodePudding user response:

The solution to your problem might be:

awk -F '|' 'BEGIN{print "Line", "\tCount"}{print NR "\t" NF-1}' input.txt

You can of course parse the output like this:

awk -F '|' 'BEGIN{print "Line", "\tCount"}{print NR "\t" NF-1}' input.txt > output.txt

I found the solution under this post:

https://www.baeldung.com/linux/count-occurrences-of-character-per-line-field

EDIT:

If you want to to only display the lines where there are more than 3 '|':

awk -F '|' 'BEGIN{print "Line", "\tCount"}{if(NF-1 > 3){print NR "\t" NF-1}}' text.txt

CodePudding user response:

Please see @RavinderSingh13's comment, and this guide, and this refrence.

But don't go away with your feelings hurt. Come sooner and more often, just follow the guidelines.

Some ways to start -

Make a test file.

$: cat x
0    
|||
this|line|has|four|pipes
four|is more|than three|pipes
0|123|45|6
this|line|also|has|at|LEAST||four|pipes|...
0|1|2

If you want the data:

$: grep -E '([|].*){4,}' x # with extended pattern matching
this|line|has|four|pipes
this|line|also|has|at|LEAST||four|pipes|...

$: grep '[|].*[|].*[|].*[|]' x # with basic; both versions work in all these
this|line|has|four|pipes
this|line|also|has|at|LEAST||four|pipes|...

$: sed -En '/([|].*){4,}/p' x
this|line|has|four|pipes
this|line|also|has|at|LEAST||four|pipes|...

$: awk '/([|].*){4,}/' x
this|line|has|four|pipes
this|line|also|has|at|LEAST||four|pipes|...

If you just want to know how many:

$: grep -Ec '([|].*){4,}' x
2

If you want the line numbers:

$: grep -En '([|].*){4,}' x
3:this|line|has|four|pipes
7:this|line|also|has|at|LEAST||four|pipes|...

If you just want to know if they exist:

$: grep -Eq '([|].*){4,}' x && echo found || echo none
found

Generally, let a precompiled program handle the searching for you, it's faster.

On the other hand, if you have several steps to accomplish, it's usually faster NOT to spawn multiple programs in a loop. Pull it all into one (be it all bash, in awk, or some other language like perl, python, etc) that can do all the things you need, preferably whatever you are most comfortable with.

CodePudding user response:

If I have a file alpha.txt and want to create a file beta.txt that includes only those lines from alpha.txt with more than three '|' characters, I would do:

egrep "(.*\|){4}" <alpha.txt >beta.txt

. . . but if that's not what you're looking for then as noted above we'll need more details.

  • Related