Given a vector of vectors of strings, like:
sentences = [ ["Julia", "is", "1000x", "faster", "than", "Python!"],
["Julia", "reads", "beautiful!"],
["Python", "has", "600", "times", "more", "libraries"]
]
I'm trying to filter out some tokens in each of them, without losing the outer vector structure (i.e., without flattening the vector down to a single list of tokens).
So far I've achieved this using a classic for loop:
number_of_alphabetical_tokens = []
number_of_long_tokens = []
total_tokens = []
for sent in sentences
append!(number_of_alphabetical_tokens, length([token for token in sent if all(isletter, token)]))
append!(number_of_long_words, length([token for token in sent if length(token) > 2]))
append!(total_tokens, length(sent))
end
collect(zip(number_of_alphabetical_tokens, number_of_long_words, total_tokens))
output: (edited as per @shayan observation)
3-element Vector{Tuple{Any, Any, Any}}:
(4, 5, 6)
(2, 3, 3)
(5, 6, 6)
This gets the job done, but it takes more time than I'd like (I have 6000 documents, with thousands of sentences each...), and it looks a bit like an antipattern.
Is there a way of doing this with comprehensions or broadcasting (or any more performant method)?
CodePudding user response:
At first, I guess you have mistakes in writing the final results; for example, you wrote 7
for the number of total tokens in the first element of the sentences
while it should be 6
actually.
You can follow such a procedure, fully vectorized:
julia> sentences = [ ["Julia", "is", "1000x", "faster", "than", "Python!"],
["Julia", "reads", "beautiful!"],
["Python", "has", "600", "times", "more", "libraries"]
];
julia> function check_all_letter(str::String)
all(isletter, str)
end
check_all_letter (generic function with 1 method)
julia> all_letters = map(x->filter(y->check_all_letter.(y), x), sentences)
3-element Vector{Vector{String}}:
["Julia", "is", "faster", "than"]
["Julia", "reads"]
["Python", "has", "times", "more", "libraries"]
julia> length.(a)
3-element Vector{Int64}:
4
2
5
I can make a similar procedure for number_of_long_words
and total_tokens
. Wrapping all of it in a function, I'll have:
julia> function arbitrary_name(vec::Vector{Vector{String}})
all_letters = map(x->filter(y->check_all_letter.(y), x), sentences)
long_words = map(x->filter(y->length.(y).>2, x), sentences)
total_tokens = length.(sentences)
return collect(zip( length.(all_letters),
length.(long_words),
total_tokens
)
)
end
arbitrary_name (generic function with 1 methods)
julia> arbitrary_name(sentences)
3-element Vector{Tuple{Int64, Int64, Int64}}:
(4, 5, 6)
(2, 3, 3)
(5, 6, 6)
Additional explanation
When I write something like length.(y).>2
, In fact, I'm trying to kinda chain some julia functions through vectorization. Consider this example to fully understand what is happening through length.(y).>2
:
julia> vec = ["foo", "bar", "baz"];
julia> lengths = length.(vec)
3-element Vector{Int64}:
3
3
3
julia> more_than_two = lengths .> 2
3-element BitVector:
1
1
1
# This is exactly equal to this:
julia> length.(vec).>2
3-element BitVector:
1
1
1
# Or
julia> vec .|> length .|> x->~isless(x, 2)
3-element BitVector:
1
1
1
I hope this help @fandak