How can I iterate through text two lines at a time?-CodePudding

I have a text file that I would like to go through and list every count each time a succession of two words appear. For example my desired output would look like this:

Sample input:

I am a man

desired output:

1 I am
1 am a
1 a man

How I thought about doing this is so:

cat $1 | sed "s/ /\n/g" | read  word1 && 
    while read word2;
    do
        echo "$word1    $word2";
        word1=word2;
    done

This gets an infinite loop though. Any help appreciated!

CodePudding user response：

Call read twice in the while condition.

while read line1; read line2; do
    echo "$line1 $line2"
done <<EOF
1
a
2
b
EOF

will output

1 a
2 b

The loop exits when the second read fails, even if the first succeeds. If you will want to execute the body (even with an empty line2), move read line2 into the body of the loop.

CodePudding user response：

With bash:

set -f         # for slurping in the words of the file, we want word splitting
               # but not glob expansion
words=( $(< "$1") )

for ((i = 1; i < ${#words[@]}; i  )); do
  printf "%s %s\n" "${words[i-1]}" "${words[i]}"
done

Given @chepner's input file, this outputs

1 a
a 2
2 b

A rewrite of your code: you need a grouping construct so that all the reads are reading from the same pipeline of data.

tr -s '[:space:]' '\n' < "$1" | {
  IFS= read -r word1
  while IFS= read -r word2; do 
    echo "$word1 $word2"
    word1=$word2
  done
}

For counting, the simplest method is to pipe the output into | sort | uniq -c
With the words.dat file from @markp-fuso, output from both these solutions is

      3 I am
      3 a man
      2 am a
      1 am not
      2 man I
      1 not a

The counting can be done in bash using an associative array:

declare -A pairs

for ((i = 1; i < ${#words[@]}; i  )); do
  key="${words[i-1]} ${words[i]}"
  pairs[$key]=$(( pairs[$key]   1 ))
done

for key in "${!pairs[@]}"; do
  printf "} %s\n" "${pairs[$key]}" "$key"
done

      1 not a
      3 a man
      1 am not
      2 am a
      3 I am
      2 man I

CodePudding user response：

Assumptions:

counts are accumulated across the entire file (as opposed to restarting the counts for each new line)
word pairs can span lines, eg, one\nword is the same as one word
we're only interested in 2-word pairings, ie, no need to code for a dynamic number of words (eg, 3-words, 4-words)

Sample input data:

$ cat words.dat
I am a man
I am not a man I
am a man

One awk idea:

$ awk -v RS='' '                       # treat file as one loooong single record
    { for (i=1;i<NF;i  )               # loop through list of fields 1 - (NF-1)
          count[$(i)" "$(i 1)]         # use field i and i 1 as array index
    }
END { for (i in count)                 # loop through array indices
          print count[i],i
    }
' words.dat

This generates:

2 am a
3 a man
1 am not
3 I am
1 not a
2 man I

NOTE: no sorting requirement was stated otherwise we could pipe the result to sort, or if using GNU awk we may be able to add an appropriate PROCINFO["sorted_in"] statement

OP's original input:

$ awk -v RS='' '{for (i=1;i<NF;i  ) count[$(i)" "$(i 1)]  } END {for (i in count) print count[i],i}' <<< "I am a man"
1 am a
1 a man
1 I am

Removing the assumption about dynamic word counts ...

$ awk -v wcnt=2 -v RS='' '                  # <word_count> = 2; treat file as one loooong single record
NF>=wcnt { for (i=1;i<=(NF-wcnt 1);i  ) {   # loop through list of fields 1 - (NF-<word_count>)
               pfx=key=""
               for (j=0;j<wcnt;j  ) {       # build count[] index from <word_count> fields
                   key=key pfx $(j i)
                   pfx=" "
               }
               count[key]  
           }
         } 

END      { for (i in count)                 # loop through array indices
               print count[i],i
         }
' words.dat

With -v wcnt=2:

2 am a
3 a man
1 am not
3 I am
1 not a
2 man I

With -v wcnt=3:

1 not a man
2 I am a
1 I am not
2 man I am
2 am a man
2 a man I
1 am not a

With -v wcnt=5:

1 I am a man I
1 I am not a man
1 am not a man I
1 am a man I am
1 man I am a man
1 man I am not a
1 a man I am not
1 not a man I am
1 a man I am a

With -v wcnt=3 and awk '...' <<< "I am a man":

1 I am a
1 am a man

With -v wcnt=5 and awk '...' <<< "I am a man":

# no output since less than wcnt=5 words to work with