I have tried to find all the Pokémon with the highest defense value using spark RDD operations, but I am only getting one out of the 3 Pokémon having highest defense values. Is there a way to get all 3 of them using only RDD operations? The Pokémon dataset can be downloaded from Pokemon data.
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("trial").setMaster("local")
sc = SparkContext(conf=conf)
input = "Pokemon.csv"
lineRDD = sc.textFile(input)
poke_def = lineRDD.map(lambda line : tuple(line.split(',')[i] for i in {1,7}) if line.split(',')[0].isdigit() else ('','0'))
poke_def.reduce(lambda x,y: x if int(x[1]) >= int(y[1]) else y)
I have also tried using max function directly instead of reduce, but that too returns only a single Pokémon.
printList(poke_def.max(lambda x: int(x[1])))
CodePudding user response:
You can use the method .top
:
>>> poke_def.top(3, key=lambda x: int(x[1]))
[('SteelixMega Steelix', '230'), ('Shuckle', '230'), ('AggronMega Aggron', '230')]
The key
parameter specifies how the rdd will be sorted. In you case, you want to sort it by defense (x[1]
), and as it is a string by defaut, you have to cast it to a numeric value in order to have a correct sorting: int(x[1])
.
CodePudding user response:
I think I did not really understood your question in my other answer. I don't delete it because it can be useful too.
In case you want to get all pokemons with the highest defense, but without knowing how many they are, you can do that:
>>> poke_def_int = poke_def.mapValues(int)
>>> max_defense = poke_def_int.values().max()
>>> best_defense_pokemonRDD = poke_def_int.filter(lambda x: x[1] == max_defense)
>>> best_defense_pokemonRDD.collect()
[('SteelixMega Steelix', '230'), ('Shuckle', '230'), ('AggronMega Aggron', '230')]