应用于大数据分析的K-means算法的实现
无需注册登录,支付后按照提示操作即可获取该资料.
应用于大数据分析的K-means算法的实现(论文15000字)
摘 要
近年来,随着大数据概念的兴起,越来越多的企业及个人将目光投向了大数据领域。相比于普通数据,大数据的特点在于其体量、种类以及速度。体量即指数据量,大数据的数据来源渠道极多,包括社交媒体、物联网等。其体量之大使得传统的数据库以及数据处理模式将变得很难应对爆炸式增长的信息量。为了更好地处理大数据,与之相应的数据存储、数据处理、数据挖掘等技术应运而生。本文将从底层存储技术和上层计算技术两个方向,以当前市面上较火热的三款大数据计算平台Hadoop、Storm以及Spark为研究对象,研究三者在存储技术上的异同,并且探究MapRdeduce分布式计算架构的特点以及局限性所在。最后比较三款分布式计算平台的异同和优缺点。之后以人工智能算法中的聚类算法为研究目标并重点研究其中的K-means算法,通过实现算法进行仿真来探究算法存在的缺陷以及改进的可能性和方法。
关键词:大数据;聚类算法;K-means算法
Abstract
In recent years, with the rise of the concept of big data, more and more enterprises and individuals pay attention to the field of big data. Compared with ordinary data, big data is characterized by its volume, type and speed. Volume refers to the amount of index data. There are many sources and channels of big data, including social media and Internet of things. Its size makes it hard for traditional databases and data-processing models to cope with the explosion of information. In order to better deal with big data, the corresponding data storage, data processing, data mining and other technologies emerge. This article from the underlying storage technology and computing two directions at the top, with the current relatively popular on the market three big data computing platform Hadoop, Storm and Spark as the research object, research explored the similarities and differences on the storage technology, and to explore the characteristics and limitations of MapRdeduce distributed computing architecture. Finally, the similarities and differences, advantages and disadvantages of the three distributed computing platforms are compared. After that, the clustering algorithm in the artificial intelligence algorithm is taken as the research target and the k-means algorithm is emphatically studied. Through realizing the simulation of the algorithm, the defects of the algorithm and the possibility and method of improvement are explored.
Keywords: Big data; Clustering algorithm; K-means algorithm
目录
第1章绪论 1
1.1 研究背景和意义 1
1.2 国内外研究发展现状 1
1.3 本文研究内容 3
第2章大数据相关技术 4
2.1 大数据生态圈 4
2.2 分布式计算技术概况 5
2.3 Hadoop 6
2.4 Storm与Spark 8
2.5 小结 9
第3章人工智能算法 10
3.1 大数据算法概况 10
3.2 聚类算法 10
3.2.1 K-means算法 10
3.2.2均值漂移聚类 11
3.2.3DBSCAN算法 11
3.2.4其余聚类算法简介 12
3.3 小结 13
第4章 K-means算法的实现 14
4.1 K-means关键点 14
4.2 K-means仿真 15
4.3 K-means算法的一些改进 18
4.4 小结 20
第5章总结与展望 21
5.1 本文总结 21
5.2 展望 22
参考文献 23
致谢 24
附录 25