I have Dataset <Row> dataset;
and want to perform some basic operation on it.
For Example:- Suppose I have 3 columns "Id","Name","Age"
and data for these columns. I want to perform these below operation on this dataset based on Name column
[1] Remove white space from Name column
[2] Remove number from Name column
[3] Remove special character from Name column
I am using java8, Apache-Spark and Apache-Spark-ml library
Please suggest best way to do this.
CodePudding user response:
Use regexp_replace()
to replace whitespaces, numbers & special characters. (Essentially retain only letters).
List<Row> rows = new ArrayList<Row>() {{
add(RowFactory.create("validName"));
add(RowFactory.create("name with whitespace "));
add(RowFactory.create("name with numbers 1234"));
add(RowFactory.create("name with special chars !@#$%"));
}};
StructField[] structFields = new StructField[]{
new StructField("Name", DataTypes.StringType, false, Metadata.empty()),
};
//create sample data
Dataset<Row> input = spark().createDataFrame(rows, new StructType(structFields));
input.withColumn("cleanedName", functions.regexp_replace(functions.col("Name"),"[^a-zA-Z] ", "")).show(100, false);
cleanedName column has the expected values:
----------------------------- --------------------
|Name |cleanedName |
----------------------------- --------------------
|validName |validName |
|name with whitespace |namewithwhitespace |
|name with numbers 1234 |namewithnumbers |
|name with special chars !@#$%|namewithspecialchars|
----------------------------- --------------------