I have a text file that I would like to go through and list every count each time a succession of two words appear. For example my desired output would look like this:
Sample input:
I am a man
desired output:
1 I am
1 am a
1 a man
How I thought about doing this is so:
cat $1 | sed "s/ /\n/g" | read word1 &&
while read word2;
do
echo "$word1 $word2";
word1=word2;
done
This gets an infinite loop though. Any help appreciated!
CodePudding user response:
Call read
twice in the while
condition.
while read line1; read line2; do
echo "$line1 $line2"
done <<EOF
1
a
2
b
EOF
will output
1 a
2 b
The loop exits when the second read
fails, even if the first succeeds. If you will want to execute the body (even with an empty line2
), move read line2
into the body of the loop.
CodePudding user response:
With bash:
set -f # for slurping in the words of the file, we want word splitting
# but not glob expansion
words=( $(< "$1") )
for ((i = 1; i < ${#words[@]}; i )); do
printf "%s %s\n" "${words[i-1]}" "${words[i]}"
done
Given @chepner's input file, this outputs
1 a
a 2
2 b
A rewrite of your code: you need a grouping construct so that all the read
s are reading from the same pipeline of data.
tr -s '[:space:]' '\n' < "$1" | {
IFS= read -r word1
while IFS= read -r word2; do
echo "$word1 $word2"
word1=$word2
done
}
For counting, the simplest method is to pipe the output into | sort | uniq -c
With the words.dat
file from @markp-fuso, output from both these solutions is
3 I am
3 a man
2 am a
1 am not
2 man I
1 not a
The counting can be done in bash using an associative array:
declare -A pairs
for ((i = 1; i < ${#words[@]}; i )); do
key="${words[i-1]} ${words[i]}"
pairs[$key]=$(( pairs[$key] 1 ))
done
for key in "${!pairs[@]}"; do
printf "} %s\n" "${pairs[$key]}" "$key"
done
1 not a
3 a man
1 am not
2 am a
3 I am
2 man I
CodePudding user response:
Assumptions:
- counts are accumulated across the entire file (as opposed to restarting the counts for each new line)
- word pairs can span lines, eg,
one\nword
is the same asone word
- we're only interested in 2-word pairings, ie, no need to code for a dynamic number of words (eg, 3-words, 4-words)
Sample input data:
$ cat words.dat
I am a man
I am not a man I
am a man
One awk
idea:
$ awk -v RS='' ' # treat file as one loooong single record
{ for (i=1;i<NF;i ) # loop through list of fields 1 - (NF-1)
count[$(i)" "$(i 1)] # use field i and i 1 as array index
}
END { for (i in count) # loop through array indices
print count[i],i
}
' words.dat
This generates:
2 am a
3 a man
1 am not
3 I am
1 not a
2 man I
NOTE: no sorting requirement was stated otherwise we could pipe the result to sort
, or if using GNU awk
we may be able to add an appropriate PROCINFO["sorted_in"]
statement
OP's original input:
$ awk -v RS='' '{for (i=1;i<NF;i ) count[$(i)" "$(i 1)] } END {for (i in count) print count[i],i}' <<< "I am a man"
1 am a
1 a man
1 I am
Removing the assumption about dynamic word counts ...
$ awk -v wcnt=2 -v RS='' ' # <word_count> = 2; treat file as one loooong single record
NF>=wcnt { for (i=1;i<=(NF-wcnt 1);i ) { # loop through list of fields 1 - (NF-<word_count>)
pfx=key=""
for (j=0;j<wcnt;j ) { # build count[] index from <word_count> fields
key=key pfx $(j i)
pfx=" "
}
count[key]
}
}
END { for (i in count) # loop through array indices
print count[i],i
}
' words.dat
With -v wcnt=2
:
2 am a
3 a man
1 am not
3 I am
1 not a
2 man I
With -v wcnt=3
:
1 not a man
2 I am a
1 I am not
2 man I am
2 am a man
2 a man I
1 am not a
With -v wcnt=5
:
1 I am a man I
1 I am not a man
1 am not a man I
1 am a man I am
1 man I am a man
1 man I am not a
1 a man I am not
1 not a man I am
1 a man I am a
With -v wcnt=3
and awk '...' <<< "I am a man"
:
1 I am a
1 am a man
With -v wcnt=5
and awk '...' <<< "I am a man"
:
# no output since less than wcnt=5 words to work with