Home > Back-end >  Spark ml basic operation in Java
Spark ml basic operation in Java

Time:05-06

I have Dataset <Row> dataset; and want to perform some basic operation on it.
For Example:- Suppose I have 3 columns "Id","Name","Age" and data for these columns. I want to perform these below operation on this dataset based on Name column
[1] Remove white space from Name column
[2] Remove number from Name column
[3] Remove special character from Name column

I am using java8, Apache-Spark and Apache-Spark-ml library

Please suggest best way to do this.

CodePudding user response:

Use regexp_replace() to replace whitespaces, numbers & special characters. (Essentially retain only letters).

List<Row> rows = new ArrayList<Row>() {{
    add(RowFactory.create("validName"));
    add(RowFactory.create("name with whitespace    "));
    add(RowFactory.create("name with numbers 1234"));
    add(RowFactory.create("name with special chars !@#$%"));

}};

StructField[] structFields = new StructField[]{
        new StructField("Name", DataTypes.StringType, false, Metadata.empty()),
};

//create sample data
Dataset<Row> input = spark().createDataFrame(rows, new StructType(structFields));
input.withColumn("cleanedName", functions.regexp_replace(functions.col("Name"),"[^a-zA-Z] ", "")).show(100, false);

cleanedName column has the expected values:

 ----------------------------- -------------------- 
|Name                         |cleanedName         |
 ----------------------------- -------------------- 
|validName                    |validName           |
|name with whitespace         |namewithwhitespace  |
|name with numbers 1234       |namewithnumbers     |
|name with special chars !@#$%|namewithspecialchars|
 ----------------------------- -------------------- 
  • Related