I'm trying to concatenate two columns from two spark data frames using SparkR e.g.
Frame1
ID Data1 Data2 ...
1 0 0
2 1 0
3 1 1
4 0 1
5 1 1
Frame2
OtherID Data1 Data2 ...
2 0 0
3 1 0
4 1 1
5 0 1
6 1 1
After concatenation:
Frame1
ID Data1 Data2 ...
12 0 0
23 1 0
34 1 1
45 0 1
56 1 1
As these are Spark frames, each column is treated as a S4 class, rather than a vector so attempting a simple paste(Frame1$ID, Frame2$OtherID)
doesn't work as R can't coerce the S4 type into a vector. My current solution then, is to collect the two frames into regular R dataframes and then run the paste command:
r_frame1 <- collect(Frame1)
r_frame2 <- collect(Frame2)
r_frame1$id <- paste(r_frame1$ID, rframe2$OtherID, sep = "")
This works for some of my data, however I need to do this for several tables, all of which are very large (several million rows) and I'm getting java.lang.outofmemory errors in my Databricks environment. Is there a more memory efficient way of doing this? Perhaps one that doesn't involve collecting into a R dataframe? I have also tried running it as a SQL command using CONCAT, but ran into some other issues (see my previous related question: Databricks SQL CONCAT function adding characters where not requested?)
CodePudding user response:
You can use the function concat
:
Frame1$id = concat(Frame1$ID, Frame2$OtherID)
I you want to concatenate with a separator you can use concat_ws
:
Frame1$id = concat_ws('_', Frame1$ID, Frame2$OtherID)
Note that if your columns are integer, you probably need to cast them to string before the concatenation with something like that:
Frame1$id = concat(cast(Frame1$ID, "string"), cast(Frame2$OtherID, "string"))