Take the following code, which is sorting a List and then filtering on it :
public static void main(String[] args) {
List<Integer> list = List.of(3,2,1);
List<Integer> filtered = list.stream()
.sorted() // Does sorted() sort the entire array first ? Then pass the entire sorted output to filter ?
.filter(x -> x < 3)
.collect(Collectors.toList());
System.out.println(filtered);
}
Does the entire sort()
happen first, then gets passed to filter()
?
Then isn't that a violation of what streams are supposed to do?
I mean, they are supposed to process one element at a time.
CodePudding user response:
Does the entire sort() happen first then gets passed to filter() ?
Then isn't that a violation of what streams are suppose to do ?
No, it isn't. Take a look at the documentation of the Stream IPA:
Intermediate operations are further divided into stateless and stateful operations. Stateless operations, such as filter and map, retain no state from previously seen element when processing a new element -- each element can be processed independently of operations on other elements. Stateful operations, such as distinct and sorted, may incorporate state from previously seen elements when processing new elements.
Stateful operations may need to process the entire input before producing a result. For example, one cannot produce any results from sorting a stream until one has seen all elements of the stream. As a result, under parallel computation, some pipelines containing stateful intermediate operations may require multiple passes on the data or may need to buffer significant data. Pipelines containing exclusively stateless intermediate operations can be processed in a single pass, whether sequential or parallel, with minimal data buffering.
That means sorted
is aware of all previously encountered elements, i.e. it's stateful. But map
and filter
don't need this information, they are stateless and lazy, these operations always process elements from the stream source one at a time.
And it's technically impossible to sort the contents of a pipeline by looking at a single element in isolation. sorted
operates on all elements "at once" and hands out a sorted stream to the next operation. You might think of sorted
as if it becomes a new source of the stream.
Let's take a look at the following stream and analyze how it will be processed:
Stream.of("foo", "bar", "Alice", "Bob", "Carol")
.filter(str -> !str.contains("r")) // lazy processing
.peek(System.out::println)
.map(String::toUpperCase) // lazy processing
.peek(System.out::println)
.sorted() // <--- all data is being dumped into memory
.peek(System.out::println)
.filter(str -> str.length() > 3) // lazy processing
.peek(System.out::println)
.findFirst(); // <--- the terminal operation
Operations filter
and map
that precede sorted
will get applied lazily on each element from the source of the stream and only when it's needed. I.e. filter will be applied on the "foo"
, it successfully passes the filter and gets transformed by the map operation. Then filter gets applied on the "bar"
and it will not reach the map. Then it will be "Alice"
's turn to pass the filter, and then the map will get executed on that string. And so on.
Keep in mind that sorted()
requires all the data to do its job, so the first filter will get executed for all elements from the source and the map will be applied on every element that passed the filter.
Then sorted()
operation will dump all the contents of the stream into memory and will sort the elements that have passed the first filter.
And after the sorting, again all elements will be processed lazily one at a time. Hence, the second filter will be applied only once (although 3
element have passed the first filter and were sorted). "Alice"
will pass the second filter and reach the terminal operation findFirst()
that will return this string.
Take a look at the debugging output from the peek()
make that the process of execution is happening as described above.