Home > Enterprise >  Tagging words in a corpus
Tagging words in a corpus

Time:11-11

I am a linguist by trade and use Perl to help me organize my data so I am not at all proficient and my solutions are not very streamlined :) My problem today, is that I want to tag all occurrences of the items in a wordlist in a corpus of example sentences.

I wrote a (sloppy) script for another experiment but it only identifies the first word from the list that it finds and then moves to the next. I need it to find all words in all sentences and on top of that, assign a tag in the second column of the wordlist. My head just cannot get around how to do this. Help!

So say my wordlist looks like this (a tab delimited .txt file with 1 column for the word and 1 for the tag to be assigned, each word and tag on one line):

boy (\t) ID01
sleep (\t) ID02
dog (\t) ID03
hungry (\t) ID04
home (\t) ID05

The original corpus would just be a tokenised .txt with one sentence per line, eg:

The boy and his dog met a hungry lion .
They needed to catch up on sleep after the dog and the boy came home .

I ideally want output like the following:

(1) A tagged corpus in the format:

The boy [ID01] and his dog [ID03] met a hungry [ID04] lion .

They needed to catch up on sleep [ID02] after the dog [ID03] and the boy [ID01] came home [ID05].

(2) A list of the words with their tags that were not found in the corpus at all. I previously just printed the words to a .txt and this worked fine.

I hope this makes sense! Here is what I have used previously to find sentences containing the words. This code is based on a much simpler word list without the ID tags and I was just looking for any matches to see if my corpus at least contained some examples. How can I best adapt it? This took me ages to write but I am learning!

Thank you!

use strict;
use warnings;

my %words;
my %print;

open (IN2, "<LemmasFromTagset.MG.2022-10-17.txt"); #the wordlist

while (my $s = <IN2>)
{
    chomp $s;
    my $lcs = lc $s;
    $words{$lcs} = 1;
}
close(IN2);
open (OUT, ">TaggedSentences.txt"); #the final output with tagged sentences
open (OUT2, ">NotFound.txt"); #words for which there are no sentences

foreach my $word (sort keys (%words))
{
    open (IN,"<all-sentences_cleaned.tagged.desentensised.txt"); #the corpus
    
    print $word."\n";
    
    my $count = 0;
    
    while(my $s = <IN>)
    {
        chomp $s;
        my $lcs = lc $s;
        if ($lcs =~ /^(.*)(\W )($word)(\W )(.*)$/)
        {
        print OUT $word."\t".$s."\n";
        $count   ;
        }
    elsif ($lcs =~ /^($word)(\W )(.*)$/)
    {
       print OUT $word."\t".$s."\n";
       $count   ;
    }
    }
    
    if ($count == 0)
    {
    print OUT2 $word."\n";
    }
    close(IN);
}
close(OUT);
close (OUT2);

CodePudding user response:

I'm not sure I quite follow the logic of your code, but the replacement is easy enough.

  1. Process the key file, make the key lower case and the value the tag in your %words hash. E.g. $words{$key} = $value. We can do that quick and easy with a do statement, where we slurp the file and process it with a map statement.
  2. Make a regex to search for the key words, using the alternator |.
  3. Read the input file, find and capture the key word with parens (), keep the matched word with \K. Replace (add) a space, tag open, your hash value made lower case with the \L escape sequence, and tag close.
  4. Print.
use strict;
use warnings;

my %words = do { 
    open my $fh, "<", "words.tsv" or die $!;
    map { chomp; split /\t/ } <$fh>;
};
my $find = join '|', map lc, keys %words;
while (<DATA>) {
    s/($find)\K/ <$words{ "\L$1" }>/ig;
    print;
}


__DATA__
The boy and his dog met a hungry lion .
They needed to catch up on sleep after the dog and the boy came home .

If you want to make this program flexible, you can just replace <DATA> with <> and use the file name to process as argument, and then redirect the output to a file e.g.:

$ perl words.pl corpus.txt > output.txt

CodePudding user response:

If I understand correctly, you might benefit from two hashes for your wordlist, one for the IDs and one for the frequencies. For the IDs:

my %words2ids;  # will be { "sleep" => "ID02", "boy" => "ID01", ...}

open(my $lemmas, "...") or die;
while (my $line = <$lemmas>) {
  chomp($line);
  my ($word, $id) = split "\t", $line;
  $words2ids{ lc($word) } = $id;   # note: lc($word)
}

Next, go through the raw corpus' tokens, counting and tagging the ones you're interested in:

my %freq;
open (my $output, "...") or die;
....
while (my $line = <$corpus>) {
  chomp($line);
  my @tokens = split ' ', $line;
  foreach my $token (@tokens) {
    my $lct = lc $token;
    if (my $id = $words2ids{ $lct }) { # false if no entry
      $freq{$lct}  ;     # add one to the frequency count
      $token .= " $id";  # cheat a bit and append $id to the aliased token
    }
   }
  # now reconstruct the input line for our output
  say { $output } "@tokens"; # use feature qw(say)
}

In the above we run through the corpus once, instead of once per word in your taglist.

Now your "NotFound" entries are those keys in %words2ids that do not appear in %freq:

open (my $notfound, "...") or die;
foreach my $word (sort keys(%words2ids)) {
  next if exists $freq{$word};  # skip if we've seen it
  say { $notfound } "$word $words2ids{$word}";
}

That's not terribly idiomatic nor particularly slick Perl — and it certainly isn't tested! — but I think it fairly captures your problem and the approach you want to take.

  •  Tags:  
  • perl
  • Related