How to collect data from a stream in different lists based on a condition?-CodePudding

I have a stream of data as shown below and I wish to collect the data based on a condition.

Stream of data:

452857;0;L100;csO;20220411;20220411;EUR;000101435; ; ;F;1;EUR;000100000; ;
452857;0;L120;csO;20220411;20220411;EUR;000101435; ; ;F;1;EUR;000100000; ;
452857;0;L121;csO;20220411;20220411;EUR;000101435; ; ;F;1;EUR;000100000; ;
452857;0;L126;csO;20220411;20220411;EUR;000101435; ; ;F;1;EUR;000100000; ;
452857;0;L100;csO;20220411;20220411;EUR;000101435; ; ;F;1;EUR;000100000; ;
452857;0;L122;csO;20220411;20220411;EUR;000101435; ; ;F;1;EUR;000100000; ;

I wish to collect the data based on the index = 2 (L100,L121 ...) and store it in different lists of L120,L121,L122 etc using Java 8 streams. Any suggestions? Note: splittedLine array below is my stream of data.

For instance: I have tried the following but I think there's a shorter way:

List<String> L100_ENTITY_NAMES = Arrays.asList("L100", "L120", "L121", "L122", "L126");


 List<List<String>> list=  L100_ENTITY_NAMES.stream()
                            .map(entity -> Arrays.stream(splittedLine)
                                    .filter(line -> {
                                        String[] values =  line.split(String.valueOf(DELIMITER));
                                        if(values.length > 0){
                                            return entity.equals(values[2]);
                                        }
                                        else{
                                            return false;
                                        }
                                    }).collect(Collectors.toList())).collect(Collectors.toList());

CodePudding user response：

I'd rather change the order and also collect the data into a Map<String, List<String>> where the key would be the entity name.

Assuming splittedLine is the array of lines, I'd probably do something like this:

Set<String> L100_ENTITY_NAMES = Set.of("L100", ...);
String delimiter = String.valueOf(DELIMITER);

Map<String, List<String>> result = 
  Arrays.stream(splittedLine)   
      .map(line -> {
        String[] values =  line.split(delimiter );
        if( values.length < 3) {
          return null;
        }

        return new AbstractMap.SimpleEntry<>(values[2], line);
      })
     .filter(Objects::nonNull)
     .filter(tempLine -> L100_ENTITY_NAMES.contains(tempLine.getEntityName()))
     .collect(Collectors.groupingBy(Map.Entry::getKey,
                Collectors.mapping(Map.Entry::getValue, Collectors.toList());

Note that this isn't necessarily shorter but has a couple of other advantages:

It's not O(n*m) but rather O(n * log(m)), so it should be faster for non-trivial stream sizes
You get an entity name for each list rather than having to rely on the indices in both lists
It's easier to understand because you use distinct steps:
- split and map the line
- filter null values, i.e. lines that aren't valid in the first place
- filter lines that don't have any of the L100 entity names
- collect the filtered lines by entity name so you can easily access the sub lists

CodePudding user response：

You're effectively asking for what languages like Scala provide on collections: groupBy. In Scala you could write:

splitLines.groupBy(_(2)) // Map[String, List[String]]

Of course, you want this in Java, and in my opinion, not using streams here makes sense due to Java's lack of a fold or groupBy function.

HashMap<String, ArrayList<String>> map = new HashMap<>();
for (String[] line : splitLines) {
    if (line.length < 2) continue;
    ArrayList<String> xs = map.getOrDefault(line[2], new ArrayList<>());
    xs.addAll(Arrays.asList(line));
    map.put(line[2], xs);
}

As you can see, it's very easy to understand, and actually shorter than the stream based solution.

I'm leveraging two key methods on a HashMap.

The first is getOrDefault; basically if the value associate with our key doesn't exist, we can provide a default. In our case, an empty ArrayList.
The second is put, which actually acts like a putOrReplace because it lets us override the previous value associated with the key.

I hope that was helpful. :)

CodePudding user response：

you're asking for a shorter way to achieve the same, actually your code is good. I guess the only part that makes it look lengthy is the if/else check in the stream.

    if (values.length > 0) {
        return entity.equals(values[2]);
    } else {
        return false;
    }

I would suggest introduce two tiny private methods to improve the readability, like this:

    List<List<String>> list = L100_ENTITY_NAMES.stream()
    .map(entity -> getLinesByEntity(splittedLine, entity)).collect(Collectors.toList());

    private List<String> getLinesByEntity(String[] splittedLine, String entity) {
        return Arrays.stream(splittedLine).filter(line -> isLineMatched(entity, line)).collect(Collectors.toList());
    }

    private boolean isLineMatched(String entity, String line) {
        String[] values = line.split(DELIMITER);
        return values.length > 0 && entity.equals(values[2]);
    }

CodePudding user response：

I would convert the semicolon-delimited lines to objects as soon as possible, instead of keeping them around as a serialized bunch of data.

First, I would create a model modelling our data:

public record LBasedEntity(long id, int zero, String lcode, …) { }

Then, create a method to parse the line. This can be as well an external parsing library, for this looks like CSV with semicolon as delimiter.

private static LBasedEntity parse(String line) {
    String[] parts = line.split(";");
    if (parts.length < 3) {
        return null;
    }

    long id = Long.parseLong(parts[0]);
    int zero = Integer.parseInt(parts[1]);
    String lcode = parts[2];
    …
    return new LBasedEntity(id, zero, lcode, …);
}

Then the mapping is trivial:
```
Map<String, List<LBasedEntity>> result = Arrays.stream(lines)
    .map(line -> parse(line))
    .filter(Objects::nonNull)
    .filter(lBasedEntity -> L100_ENTITY_NAMES.contains(lBasedEntity.lcode()))
    .collect(Collectors.groupingBy(LBasedEntity::lcode));
```
- map(line -> parse(line)) parses the line into an LBasedEntity object (or whatever you call it);
- filter(Objects::nonNull) filters out all null values produced by the parse method;
- The next filter selects all entities of which the lcode property is contained in the L100_ENTITY_NAMES list (I would turn this into a Set, to speed things up);
- Then a Map is with key-value pairs of L100_ENTITY_NAME → List<LBasedEntity>.