I have a RDD file that have two columns O and D. There is a edges between each values of the columns. For example,
O | D |
---|---|
a | b |
b | g |
c | t |
g | a |
That mean a related to b... And I need to have file like this but with filter all nodes that do not appear to column O. Here we will do the same without the row c -- t because t not appear in column O. I try something that seem to work. I do list with all the column O and filter all value of D that not appear in this list
list_O = df.select('O').rdd.flatMap(lambda x: x).collect()
df1 = df.filter(df.D.isin(list_O)).show()
And when I want to see the head of this new rdd it is error
df1.head(5)
error I don't understand why.
Any Ideas?
CodePudding user response:
Yes I have an idea. The function .show() returns None. Remove the .show() (it is only supposed to print things). df1 is set to None in your code.
list_O = df.select('O').rdd.flatMap(lambda x: x).collect()
df1 = df.filter(df.D.isin(list_O))