Mapreduce:在大集群上处理数据
无需注册登录,支付后按照提示操作即可获取该资料.
Mapreduce:在大集群上处理数据(中文15000字,英文6700字)
摘要
MapReduce是一种编程模型,并且是一种联合处理和产生大数集的执行过程。用户指定一个映射(map)函数,用来处理一个产生其他key/value媒介对的key/value对;用户指定一个化简(reduce)函数,合并所有的媒介value和key。这篇论文将表明,许多现实世界的任务都可以用这个模型描述。以这个函数形式写出来的程序都是自动并行化的,并且执行在家用计算机组成的云中。这个实时系统有以下功能:保存分离的数据;部署程序在一组机器上执行;处理机器错误;管理机器之间的通信。这允许程序员无需任何并行和分布式系统的经验,就能很容易地使用大分布系统的资源。我们的MapReduce程序运行在许多家用计算机组成的云上,并且高度分级化。一个典型的MapReduce计算,在数以千计的计算机上处理吉兆字节的数据。程序员会发现此系统容易使用,即数以百计的MapReduce程序被植入,每天超过一千个MapReduce被实施在Google的云上。
Mapreduce: processing data in a large cluster
Abstract
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the pro-gram's execution across a set of machines, handling ma-chine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google's clusters every day.