Home > database >  How to best model an "Array of collection" for Apache Beam @JavaBeanSchema and BigQuery?
How to best model an "Array of collection" for Apache Beam @JavaBeanSchema and BigQuery?

Time:02-01

I have a Java class that models data meant for writing to both BigQuery and Elasticsearch. It looks something like this:

@DefaultSchema(JavaBeanSchema.class)
// also lombok annotations for getters, setters, builder, constructors, etc.
public class DataClass implements Serializable {
  String field1;
  List<String> field2;
  List<List<String>> field3; // this one gives the compiler error below
}

We try to always use JavaBeanSchema.class for its nice compatibility with org.apache.beam.sdk.values.Row, com.google.api.services.bigquery.model.TableRow, and the org.apache.beam.sdk.io.gcp.bigquery.ToTableRow format function with org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO. Combining these things means we can use BigQueryIO.Write without any explicit schema or coders, minimal boilerplate code.

Compiling the above class into a Beam template results in an IllegalArgumentException "Array of collection is not supported in BigQuery." OK - fine, but we still need this class to model the data for Elasticsearch where this schema works fine in JSON.

I am looking for the simplest way I can do some version of what I need to with minimal code. The solution I currently have is creating a separate class DataClassForBigQuery that is essentially a copy of this one except field3 is just a List<String> type, along with a PTransform<DataClass, DataClassForBigQuery> to serialize field3 as a JSON string when they are created. This is a relatively small amount of code, a fairly isolated, but:

  • I don't love the two classes to model the same data, just means more tests and maintenance
  • BQ has a native JSON field type, with some support for querying the values. Since technically the BQ field type of field3 is STRING, there's no good way to query/access that data without converting it to something else first. It would be nice if there was a way to annotate "this is a java String type, but interpret it as JSON".
  • I tried modeling the POJO with more complex types to refactor/wrap the nested Lists into a List of Objects and each Object contained a List of Strings. I think this works for Big Query, but Beam wasn't able to build the template. I believe the error was a stack overflow trying to build the schema.
  • If I could use the POJO class as-is, and somehow modify the schema on the fly to change the type of field3, and provide a function to format it but still leverage ToTableRow for the rest of it (the real class is large, 20 fields), I'd try that. But I don't really want to extend org.apache.beam code to this, that would end up being more work.

Any ideas?

CodePudding user response:

One approach is to use Jackson or Gson library to serialize the List<List<String>> field3 as a JSON string before writing it to BigQuery. You can create a custom serializer that can handle the nested List and write it as a JSON string to BigQuery. In Elasticsearch, the field3 can remain as is since it is already supported in JSON. To avoid duplicating the class, you can use inheritance or composition and create a subclass or an instance of the original class for BigQuery and override the serialization behavior for field3. Another alternative is to flatten the nested List<List<String>> field3 into a single List before writing it to BigQuery and then reverse the process when reading from BigQuery. This way you can use the same class for both BigQuery and Elasticsearch.

CodePudding user response:

Consider avoiding a list of lists by introducing an intermediate type to hold the inner list:

@DefaultSchema(JavaBeanSchema.class)
public class MyIntermediateType implements Serializable {
  List<String> innerField;
}

@DefaultSchema(JavaBeanSchema.class)
// also lombok annotations for getters, setters, builder, constructors, etc.
public class DataClass implements Serializable {
  String field1;
  List<String> field2;
  List<MyIntermediateType> field3;
}

BigQuery will be able to represent this without further processing (field3 becomes a repeated STRUCT that has one field that is a repeated STRING). I hope that Elasticsearch would also be able to interpret this structure, but I don't know details of Elasticsearch ingestion.

  • Related