Home > database >  word count using awk for the sample file
word count using awk for the sample file

Time:10-13

The awk script should produce a table that lists words with more than 4 letters and more than 2 occurrences. The last line of the output should display the number of words in the file.

Input file

In navigation, the heading of a vessel or aircraft is the compass direction
in which the craft's bow or nose is pointed. Note that the heading may not
#necessarily be the direction that the vehicle actually travels, which is
known as its course or track. Any difference between the heading and
course is due to the motion of the underlying medium, the air or water,
or other effects like skidding or slipping. The difference is known as
the drift, and can be determined by the wind triangle. At least seven ways
to measure the heading of a vehicle have been described. A compass installed
in a vehicle or vessel has a certain amount of error caused by the magnetic 
properties of the vessel. This error is known as compass deviation. The 
magnitude of the compass deviation varies greatly depending upon the local 
anomalies created by the vessel. A fiberglass recreational vessel will 
#generally have much less compass deviation than a steel-hulled vessel. 
Electrical wires carrying current have a small magnetic field around them 
and can cause deviation. Any type of magnet, such as found in a speaker 
can also cause large magnitudes of compass deviation. The error can be 
corrected using a deviation table. Deviation tables are very difficult to 
create. Once a deviation table is established, it is only good for that 
particular vessel, with that particular configuration. If electrical wires 
are moved or anything else magnetic (speakers, electric motors, etc.) are 
moved, the deviation table will change. All deviations in the deviation 
table are indicated west or east. If the compass is pointing west of the 
Magnetic North Pole, then the deviation is westward. If the compass is 
pointing east of the Magnetic North Pole, then the deviation is eastward.
  1. Write an awk script that produces a report from an input file. The report counts the number of times a word occurs in the input file.**

  2. Separately implement the additional functionality that ignores punctuation characters such as ".,;:()".*

cat input.txt|awk -F" "'{for(i=1;i<=NF;i ) a[$i] } END {for(k in a) print k,a[k]}'

outcome I expect enter image description here

CodePudding user response:

NOTE: this looks (to me) like a homework assignment so I'm just going to address current code.

awk '{ for (i=1;i<=NF;i  ) a[$i]   }             # remains the same
END  { for (k in a) { 
           total =a[k]                           # keep track of total word count
           if ( a[k] > 2 && length(k) > 4 )      # apply filters to limit output
              print k,a[k]
       }
       printf "\nTotal: %s\n", total
     }
' input.txt

This generates:

compass 8
deviation 8       #  1 for 'Deviation'                 ?
deviation. 3      # need to strip "."
error 3
heading 4
known 3
magnetic 3        #  2 for 'Magnetic'                  ?
table 3           #  1 for 'table.'                    ?
vehicle 3
vessel 3          #  1 for 'vessel,'                   ?
vessel. 3         # need to strip "."
                  #  2 for 'moved' & 'moved.'          ?
                  #  2 for 'electrical' & 'Electrical' ?
Total: 292

NOTES:

  • no sorting requirements have been provided (I piped through sort -f to generate this output so it's easier to see where potential issues need to be addressed)
  • assuming OP's desired output (in the image) is a subset of what's expected (eg, vehicle does not show up in OP's output; my script found compass 8 while OP's (image) shows compass 7)
  • assuming number of words in file applies to ALL words and not just those of length > 4 and count > 2 (I found 292 while OP's output shows 271); otherwise OP will need to add logic to determine if a[$i] should be performed
  • OP needs to implement additional logic to strip out punctuation before performing the a[$i] (hint: gsub() or gensub() functions)
  • does OP need to implement case-insensitive storage of words for counting purposes? if so this will increase the number of hits for some words (eg, magnetic and deviation) (hint: tolower() function)

CodePudding user response:

$ cat awk.script

#!/usr/bin/env awk -f 

BEGIN {
    print "\tWord Count\n--------------------"
} /vessel/ {
    vessel  ;next
} /compass/ {
    compass  
} /known/ {
    known  
} /table/ {
    table  
} /heading/ {
    heading  
} /magnetic/ {
    mag  
} /deviation/ {
    dev  
} /error/ {
    error  
} END {
    print "\tvessel "vessel "\n\tcompass "compass "\n\tknown "known "\n\ttable "table "\n\theading "heading "\n\tmagnetic "mag "\n\tdeviation "dev "\n\terror "error "\n--------------------\nNumber of words: " total
} ; {
    total=vessel   compass   known   table   heading   mag   dev   error
}
$ awk -f awk.script  RS=" " input_file
   Word Count
--------------------
        vessel 7
        compass 8
        known 3
        table 5
        heading 4
        magnetic 3
        deviation 12
        error 3
--------------------
Number of words: 45


      
  • Related