I want to understand spark by reading code files from apache spark github link.
https://github.com/apache/spark
I have some experience in Scala, but most of my experience has been in PySpark. And I undertand Spark architecture and various optimization techniques too, but I am curious how they are implemented internally. For example what happens when I call repartition() method. Could anyone from community guide me in how I should go about it.
CodePudding user response:
Use IntelliJ IDEA to open the sources of Apache Spark. You can open the sources as a maven or sbt project (just pick the proper build configuration).
Once the above's done, Cmd Option o to find a symbol of your interest, e.g. repartition()
. Use Cmd b to drill down until you're at the very bottom (of the call chain) and go up (to take a breath...not break!) Rinse and repeat.