Home > OS >  Java-Stream - Sort and Transform elements using groupingBy()
Java-Stream - Sort and Transform elements using groupingBy()

Time:12-09

Suppose I have the following domain class:

public class OrderRow {
    private Long orderId;
    private String action;
    private LocalDateTime timestamp;
    
    // getters, constructor, etc.
}

I have a following data set of OrderRows :

OrderId     Action                  Timestamp
3           Pay money               2015-05-27 12:48:47.000
3           Select Item             2015-05-27 12:44:47.000
1           Generate Payment        2015-05-27 12:55:47.000
2           Pay money               2015-05-27 12:48:47.000
2           Select Item             2015-05-27 12:44:47.000
2           Deliver                 2015-05-27 12:55:47.000
1           Generate Invoice        2015-05-27 12:48:47.000
1           Create PO               2015-05-27 12:44:47.000
3           Deliver                 2015-05-27 12:55:47.000

What I want to obtain the following Map from the sample data shown above:

[3] -> ["Select Item", "Pay money", "Deliver"]
[1] -> ["Create PO", "Generate Invoice", "Generate Payment"]
[2] -> ["Select Item", "Pay money", "Deliver"]  

By performing below operations :

  1. I want to groupBy orderId.
  2. Sort actions by timestamp.
  3. Create a Set (as there can be duplicates) of actions.

I am trying to do this in a single groupingBy operation as performing separate sorting, mapping operations take a lot of time if data set is huge.

I've tried to do the following:

orderRows.stream()
    .collect(Collectors.groupingBy(OrderRow::getOrderId, 
        Collectors.mapping(Function.identity(),
            Collectors.toCollection(
                () -> new TreeSet<>(Comparator.comparing(e -> e.timestamp))
    ))));

But then I get output as Map<String, Set<OrderRow>> where as I need the result of type Map<String, Set<String>>.

Would be really grateful if someone can show me at least a direction to go.

Note that is a critical operation and should be done in few milliseconds, hence performance is important.

CodePudding user response:

TL;DR

  • LinkedHashSet - is the only implementation of the Set interface, which would to retain the order of the plain Strings (action property), that differs from the alphabetical one.

  • Timsort has a slightly better performance than sorting via Red-black tree (i.e. by storing elments into a TreeSet). Since OP has said that "is a critical operation", that worth to take into consideration.


You can sort the stream elements by timestamp before applying collect. That would guarantee that actions in the list would be consumed in the required order.

And to retain the order of these Strings, you can use a LinkedHashSet (TreeSet would not be helpful in this case since you need to keep the order by property which would not be present in the collection).

var actionsById = orderRows.stream()
    .sorted(Comparator.comparing(OrderRow::getTimestamp))
    .collect(Collectors.groupingBy(
        OrderRow::getOrderId,
        Collectors.mapping(
            OrderRow::getAction,
            Collectors.toCollection(LinkedHashSet::new))
    ));

Alternatively, you can sort the sets while grouping (as you were trying to do).

var actionsById = orderRows.stream()
    .collect(Collectors.groupingBy(
        OrderRow::getOrderId,
        Collectors.collectingAndThen(
            Collectors.mapping(
                Function.identity(),
                Collectors.toCollection(() -> new TreeSet<>(Comparator.comparing(OrderRow::getTimestamp)))
            ),
            set -> set.stream().map(OrderRow::getAction).collect(Collectors.toCollection(LinkedHashSet::new))
        )
    ));

Or instead of sorting via Red-Black Tree (TreeSet is backed by the an implementation of this data structure) each set can be sorted via Timsort (which is an algorithm used while sorting an array of elements, and sorted() operation is invoked in the stream, its elements are being dumped into an array).

Maintaining a Red-Black Tree is constful, and theoretically Timsort should perform a bit better.

var actionsById = orderRows.stream()
    .collect(Collectors.groupingBy(
        OrderRow::getOrderId,
        Collectors.collectingAndThen(
            Collectors.mapping(
                Function.identity(), Collectors.toList()
            ),
            list -> list.stream().sorted(Comparator.comparing(OrderRow::getTimestamp))
                .map(OrderRow::getAction)
                .collect(Collectors.toCollection(LinkedHashSet::new))
        )
    ));

To eliminate allocating a new array in memory (which happens when sorted() operation is being called) we can sort the lists produced by mapping() directly, and then create a stream out of sorted lists in the finisher function of collectingAndThen().

That would require writing a multiline lambda, but in this case is justifiable by the performance gain.

var actionsById = orderRows.stream()
    .collect(Collectors.groupingBy(
        OrderRow::getOrderId,
        Collectors.collectingAndThen(
            Collectors.mapping(
                Function.identity(), Collectors.toList()
            ),
            list -> {
                list.sort(Comparator.comparing(OrderRow::getTimestamp));
                return list.stream().map(OrderRow::getAction)
                    .collect(Collectors.toCollection(LinkedHashSet::new));
            }
        )
    ));
  • Related