Suppose I have the following domain class:
public class OrderRow {
private Long orderId;
private String action;
private LocalDateTime timestamp;
// getters, constructor, etc.
}
I have a following data set of OrderRow
s :
OrderId Action Timestamp
3 Pay money 2015-05-27 12:48:47.000
3 Select Item 2015-05-27 12:44:47.000
1 Generate Payment 2015-05-27 12:55:47.000
2 Pay money 2015-05-27 12:48:47.000
2 Select Item 2015-05-27 12:44:47.000
2 Deliver 2015-05-27 12:55:47.000
1 Generate Invoice 2015-05-27 12:48:47.000
1 Create PO 2015-05-27 12:44:47.000
3 Deliver 2015-05-27 12:55:47.000
What I want to obtain the following Map
from the sample data shown above:
[3] -> ["Select Item", "Pay money", "Deliver"]
[1] -> ["Create PO", "Generate Invoice", "Generate Payment"]
[2] -> ["Select Item", "Pay money", "Deliver"]
By performing below operations :
- I want to groupBy
orderId
. - Sort actions by
timestamp
. - Create a
Set
(as there can be duplicates) ofaction
s.
I am trying to do this in a single groupingBy
operation as performing separate sorting, mapping
operations take a lot of time if data set is huge.
I've tried to do the following:
orderRows.stream()
.collect(Collectors.groupingBy(OrderRow::getOrderId,
Collectors.mapping(Function.identity(),
Collectors.toCollection(
() -> new TreeSet<>(Comparator.comparing(e -> e.timestamp))
))));
But then I get output as Map<String, Set<OrderRow>>
where as I need the result of type Map<String, Set<String>>
.
Would be really grateful if someone can show me at least a direction to go.
Note that is a critical operation and should be done in few milliseconds, hence performance is important.
CodePudding user response:
TL;DR
LinkedHashSet
- is the only implementation of theSet
interface, which would to retain the order of the plainString
s (action
property), that differs from the alphabetical one.Timsort has a slightly better performance than sorting via Red-black tree (i.e. by storing elments into a
TreeSet
). Since OP has said that "is a critical operation", that worth to take into consideration.
You can sort the stream elements by timestamp before applying collect
. That would guarantee that action
s in the list would be consumed in the required order.
And to retain the order of these Strings, you can use a LinkedHashSet
(TreeSet
would not be helpful in this case since you need to keep the order by property which would not be present in the collection).
var actionsById = orderRows.stream()
.sorted(Comparator.comparing(OrderRow::getTimestamp))
.collect(Collectors.groupingBy(
OrderRow::getOrderId,
Collectors.mapping(
OrderRow::getAction,
Collectors.toCollection(LinkedHashSet::new))
));
Alternatively, you can sort the sets while grouping (as you were trying to do).
var actionsById = orderRows.stream()
.collect(Collectors.groupingBy(
OrderRow::getOrderId,
Collectors.collectingAndThen(
Collectors.mapping(
Function.identity(),
Collectors.toCollection(() -> new TreeSet<>(Comparator.comparing(OrderRow::getTimestamp)))
),
set -> set.stream().map(OrderRow::getAction).collect(Collectors.toCollection(LinkedHashSet::new))
)
));
Or instead of sorting via Red-Black Tree (TreeSet
is backed by the an implementation of this data structure) each set can be sorted via Timsort (which is an algorithm used while sorting an array of elements, and sorted()
operation is invoked in the stream, its elements are being dumped into an array).
Maintaining a Red-Black Tree is constful, and theoretically Timsort should perform a bit better.
var actionsById = orderRows.stream()
.collect(Collectors.groupingBy(
OrderRow::getOrderId,
Collectors.collectingAndThen(
Collectors.mapping(
Function.identity(), Collectors.toList()
),
list -> list.stream().sorted(Comparator.comparing(OrderRow::getTimestamp))
.map(OrderRow::getAction)
.collect(Collectors.toCollection(LinkedHashSet::new))
)
));
To eliminate allocating a new array in memory (which happens when sorted()
operation is being called) we can sort the lists produced by mapping()
directly, and then create a stream out of sorted lists in the finisher function of collectingAndThen()
.
That would require writing a multiline lambda, but in this case is justifiable by the performance gain.
var actionsById = orderRows.stream()
.collect(Collectors.groupingBy(
OrderRow::getOrderId,
Collectors.collectingAndThen(
Collectors.mapping(
Function.identity(), Collectors.toList()
),
list -> {
list.sort(Comparator.comparing(OrderRow::getTimestamp));
return list.stream().map(OrderRow::getAction)
.collect(Collectors.toCollection(LinkedHashSet::new));
}
)
));