Home >
other > The Spark - shell RDD operator
The Spark - shell RDD operator
Known movie data set movie_metadata CSV, it contains 28 columns, each column of data with ", "space, the data meaning of each column by column name judge for themselves,
With the method of Spark read the file:
//to read files for RDD, and filter out the first line meta information
Val RDD=sc. TextFile (" movie_metadata. CSV "). The filter (! _. StartsWith (" color, director_name "))
//divided each line according to the ", "
Val movieRdd=RDD. The map (_. The split (", "))
Please according to the above hints in spark - shell (or the IDE) using RDD API code to achieve the following functions:
1, please output the data set contains all the names of the different countries (using the country a column) (country in 20)
2, please output the data set contains the number of Chinese films (using the country a column)
3, please output the most closely watched movie name of three Chinese film, director, and showing time (use movie_title, director_name num_voted_users, country and title_year five columns)