In modern scientific research, massive datasets with huge numbers of observations are frequently encountered. To facilitate the computational process, a divide-and-conquer scheme is often used for the analysis of big data. In such a strategy, a full dataset is first split into several manageable segments; the final output is then aggregated from the individual outputs of the segments. Despite its popularity in practice, it remains largely unknown that whether such a distributive strategy provides valid theoretical inferences to the original data; if so, how efficient does it work? In this talk, I address these fundamental issues for the non-parametric distributed kernel regression, where accurate prediction is the main learning task. I will begin with the naive simple averaging algorithm and then talk about an improved approach via ADMM. The promising preference of these methods is supported by both simulation and real data examples.
|