Home > Net >  using regex with group by in Apache spark
using regex with group by in Apache spark

Time:07-19

I have two tables first one contains the car model, and second one contains specific mode of that model

for example :

This is a part of first table :

Car Brand:

Abarth

Alfa Romeo

Aston Martin

Audi

Bentley

BMW

and second table :

Make/Model:

Chevrolet Pickup (Full Size)

Ford Pickup (Full Size)

Toyota Camry

Nissan Altima

Chevrolet Impala

Honda Accord

GMC Pickup (Full Size)

and I need to join these two tables, I need to use regex on it so i can take the first part of the second table and join the two tables

for example :

> Honda Accord join with Honda in first table

I did something like this :

Dataset<Row> updatedCars = carsTable.join(carsTheftsTable, expr("Car Brand rlike Make/Model")).cache();

but it's not working in java spark expr not found

any help ?

CodePudding user response:

You could try to cross join and check each model with each car brand if they fit.

For example in pyspark:

import pyspark.sql.functions as F  # noqa
from pyspark.sql.types import StructType, StructField, StringType

df1 = spark.createDataFrame(data=[
    ["Honda"],
    ["Alfa Romeo"],
    ["Aston Martin"],
    ["Toyota"],
    ["Nissan"],
    ["BMW"]
], schema=StructType([StructField("car_brand", StringType())])).cache()

df2 = spark.createDataFrame(data=[
    ["Chevrolet Pickup (Full Size)"],
    ["Ford Pickup (Full Size)"],
    ["Toyota Camry"],
    ["Nissan Altima"],
    ["Chevrolet Impala"],
    ["Honda Accord"],
    ["GMC Pickup (Full Size)"]
], schema=StructType([StructField("car_model", StringType())])).cache()

df1.crossJoin(df2).filter(F.col("car_model").contains(F.col("car_brand"))).show(100, False)

 --------- -------------                                                        
|car_brand|car_model    |
 --------- ------------- 
|Honda    |Honda Accord |
|Toyota   |Toyota Camry |
|Nissan   |Nissan Altima|
 --------- ------------- 
  • Related