Here is some example data turned into an RDD:
my_data = [{'id': '001', 'name': 'Sam', 'class': "classA", 'age': 15, 'exam_score': '90'},
{'id': '002', 'name': 'Tom', 'class': "classA", 'age': 15, 'exam_score': '78'},
{'id': '003', 'name': 'Ben', 'class': "classB", 'age': 16, 'exam_score': '91'},
{'id': '004', 'name': 'Max', 'class': "classB", 'age': 16, 'exam_score': '76'},
{'id': '005', 'name': 'Ana', 'class': "classA", 'age': 15, 'exam_score': '88'},
{'id': '006', 'name': 'Ivy', 'class': "classA", 'age': 16, 'exam_score': '77'},
{'id': '007', 'name': 'Eva', 'class': "classB", 'age': 15, 'exam_score': '86'},
{'id': '008', 'name': 'Zoe', 'class': "classB", 'age': 16, 'exam_score': '89'}]
my_rdd = sc.parallelize(my_data)
Running my_rdd
by itself returns:
#>>> ParallelCollectionRDD[117] at readRDDFromFile at PythonRDD.scala:274
And I know you can display the RDD with my_rdd.collect()
which returns:
#[{'age': 15,
# 'class': 'classA',
# 'exam_score': '90',
# 'id': '001',
# 'name': 'Sam'},
# {'age': 15,
# 'class': 'classA',
# 'exam_score': '78',
# 'id': '002',
# 'name': 'Tom'}, ...]
I have found that I can access the keys by running my_rdd.keys()
, but this returns:
#>>> PythonRDD[121] at RDD at PythonRDD.scala:53
I want to return a list of all the distinct keys (I know the keys are the same for each line but for a scenario where they aren't I would like to to know) in the RDD - so something that looks like this:
#>>> ['id', 'name', 'class', 'age', 'exam_score']
So with this I assumed I could get this by running my_rdd.keys().distinct.collect()
but I get an error instead.
I'm still learning pyspark so I would really appreciate if someone could provide some input :)
CodePudding user response:
my_data = [{'id': '001', 'name': 'Sam', 'class': "classA", 'age': 15, 'exam_score': '90'},
{'id': '008', 'name': 'Zoe', 'xxxxx': "classB", 'age': 16, 'exam_score': '89'},
{'id': '007', 'name': 'Eva', 'class': "classB", 'age': 15, 'exam_score': '86'},
]
my_rdd = sc.parallelize(my_data)
key_rdd = my_rdd.flatMap(lambda x: x) # flatMap
print( key_rdd.collect() )
# ['id', 'name', 'class', 'age', 'exam_score', 'id', 'name', 'xxxxx', 'age', 'exam_score', 'id', 'name', 'class', 'age', 'exam_score']
print( key_rdd.distinct().collect() )
# ['class', 'id', 'exam_score', 'age', 'name', 'xxxxx']