Home > Blockchain >  Lists in dataframe columns - need to obtain an index in one column and use to to pull data from anot
Lists in dataframe columns - need to obtain an index in one column and use to to pull data from anot

Time:01-06

I have a dataframe with several columns. What I need to do is match data in a list in column Name, obtain the index, and use that index to search a list in column Date and grab the corresponding date to filter on.

Name Date
[n1, n2, n3] [date1, date2, date3]

e.g. if Name === n2, return index

then: Date[index] return date

then: filter on date

The table has thousands of rows with other columns (containing unique IDs and other data) so I'm looking for a solution to work on that scale.

This needs to be in Spark Scala, which I am very new in learning and I'm struggling to even print a value from the initial if statement. Any pointers or potential things to try would be massively appreciated.

I can access the list within the column using df.select($"name") and access a specific piece of data at an index using df.select($"name"(0))

However, if I try:

for (x <- df.select($"name"){
   println("print anything")
}

the code looks like it executes (there are no errors) but I get nothing printed at all, not even just a string like I have above.

I've been trying to get even this simple(!) first bit to work for a good few days and have tried a number of solutions found here, but most solutions seem to throw a type error (e.g., I have a row and not a list). I don't understand this as I can pull data by specifying an index, so why is it unable to read the list and thinks it is a row?

CodePudding user response:

I am not sure what you meant by then: filter on date, but this might help you achieve what you want to do.

If df was this dataset:

 ------------ --------------------- 
|Name        |Date                 |
 ------------ --------------------- 
|[n1, n2, n3]|[date1, date2, date3]|
 ------------ --------------------- 

Then this:

df.withColumn("newDate", expr("Date[find_in_set('n2', array_join(Name, ',')) - 1]"))

Gives this:

 ------------ --------------------- ------- 
|Name        |Date                 |newDate|
 ------------ --------------------- ------- 
|[n1, n2, n3]|[date1, date2, date3]|date2  |
 ------------ --------------------- ------- 

Which you can then filter or do what ever you like. Hope it helps, good luck!

  • Related