Beginner's spark, there is a question that is as follows:
Always thought that is suitable for distributed computing frameworks do some statistical type of computing, and in some modeling of machine learning algorithm, if the data is distributed across multiple nodes, now want to use these data training model, then the spark inside is how to run? Is (1) based on the data on the various nodes training model respectively, or (2) the training again after the data collection model?
If it is a way (1), multiple models that such training, the training and use of complete data do not match the model build; If it is a way. (2), then the spark inside is how to perform these operations? Also please the great god guide