Home > Software design >  Iterate over vector of vectors of Strings without using for loops in Julia
Iterate over vector of vectors of Strings without using for loops in Julia

Time:10-18

Given a vector of vectors of strings, like:

sentences = [ ["Julia", "is", "1000x", "faster", "than", "Python!"], 
              ["Julia", "reads", "beautiful!"], 
              ["Python", "has", "600", "times", "more", "libraries"] 
]

I'm trying to filter out some tokens in each of them, without losing the outer vector structure (i.e., without flattening the vector down to a single list of tokens).

So far I've achieved this using a classic for loop:

number_of_alphabetical_tokens = []
number_of_long_tokens = []
total_tokens = []

for sent in sentences
    append!(number_of_alphabetical_tokens, length([token for token in sent if all(isletter, token)]))
    append!(number_of_long_words, length([token for token in sent if length(token) > 2]))
    append!(total_tokens, length(sent))
end

collect(zip(number_of_alphabetical_tokens, number_of_long_words, total_tokens))

output: (edited as per @shayan observation)

3-element Vector{Tuple{Any, Any, Any}}:
 (4, 5, 6)
 (2, 3, 3)
 (5, 6, 6)

This gets the job done, but it takes more time than I'd like (I have 6000 documents, with thousands of sentences each...), and it looks a bit like an antipattern.

Is there a way of doing this with comprehensions or broadcasting (or any more performant method)?

CodePudding user response:

At first, I guess you have mistakes in writing the final results; for example, you wrote 7 for the number of total tokens in the first element of the sentences while it should be 6 actually.
You can follow such a procedure, fully vectorized:

julia> sentences = [ ["Julia", "is", "1000x", "faster", "than", "Python!"],
                     ["Julia", "reads", "beautiful!"],
                     ["Python", "has", "600", "times", "more", "libraries"]
                   ];

julia> function check_all_letter(str::String)
           all(isletter, str)
       end
check_all_letter (generic function with 1 method)

julia> all_letters = map(x->filter(y->check_all_letter.(y), x), sentences)
3-element Vector{Vector{String}}:
 ["Julia", "is", "faster", "than"]
 ["Julia", "reads"]
 ["Python", "has", "times", "more", "libraries"]

julia> length.(a)
3-element Vector{Int64}:
 4
 2
 5

I can make a similar procedure for number_of_long_words and total_tokens. Wrapping all of it in a function, I'll have:

julia> function arbitrary_name(vec::Vector{Vector{String}})
           all_letters = map(x->filter(y->check_all_letter.(y), x), sentences)
           long_words = map(x->filter(y->length.(y).>2, x), sentences)
           total_tokens = length.(sentences)

           return collect(zip( length.(all_letters),
                               length.(long_words),
                               total_tokens
                             )
                   )
       end
arbitrary_name (generic function with 1 methods)

julia> arbitrary_name(sentences)
3-element Vector{Tuple{Int64, Int64, Int64}}:
 (4, 5, 6)
 (2, 3, 3)
 (5, 6, 6)

Additional explanation

When I write something like length.(y).>2, In fact, I'm trying to kinda chain some julia functions through vectorization. Consider this example to fully understand what is happening through length.(y).>2:

julia> vec = ["foo", "bar", "baz"];

julia> lengths = length.(vec)
3-element Vector{Int64}:
 3
 3
 3

julia> more_than_two = lengths .> 2
3-element BitVector:
 1
 1
 1

# This is exactly equal to this:
julia> length.(vec).>2
3-element BitVector:
 1
 1
 1

# Or
julia> vec .|> length .|> x->~isless(x, 2)
3-element BitVector:
 1
 1
 1

I hope this help @fandak

  • Related