![]()
内容推荐 本书以使用Java对大数据进行基本的统计分析开篇,然后讨论如分类、回归、聚类、集成等其他数据分析主题。本书还涵盖了如推荐引擎、大规模图形分析、实时分析、深度学习等高级主题。 书中涵盖了各种案例研究,例如tweet数据集的情绪分析、针对MovieLens数据集的推荐、电子商务数据集的客户细分、真实航班数据集的图表分析。这本书是使用Java实现大数据分析的端到端指南。Java如今已经是主流大数据环境(包括Hadoop)的事实语言。本书将教你如何使用产品友好的、Java对大数据进行分析。全书内容基本上分为两部分。第一部分是入门知识,帮助读者熟悉大数据环境;第二部分包含对大数据分析中所有概念的核心讨论。它涵盖了数据分析和数据可视化、机器学习的核心概念和优势、通过朴素贝叶斯进行回归和分类的现实用法、对聚类概念的深入讨论并且回顾了使用deepLearning4j或普通的Java Spark代码基于大数据实现简单的神经网络。对于想要开始学习大数据分析并希望将其应用于现实世界的Java开发人员而言,这是一本必不可少的书籍。 作者简介 拉贾特·梅塔 is a VP (technical architect) in technology at JP Morgan Chase in New York. He is a Sun certified Java developer and has worked on Java-related technologies for more than 16 years. His current role for the past few years heavily involves the use of a big data stack and running analytics on it. He is alsoa contributor to various open source projects that are available on his GitHub repository, and is also a frequent writer for dev magazines. 目录 Preface Chapter 1:Big Data Analytics with Java Why data analytics on big data? Big data for analytics Big data - a bigger pay package for Java developers Basics of Hadoop - a Java sub-project Distributed computing on Hadoop HDFS concepts Design and architecture of HDFS Main components of HDFS HDFS simple commands Apache Spark Concepts Transformations Actions Spark Java API Spark samples using Java 8 Loading data Data operations - cleansing and munging Analyzing data - count, projection, grouping, aggregation, and max/min Actions on RDDs Paired RDDs Saving data Collecting and printing results Executing Spark programs on Hadoop Apache Spark sub-projects Spark machine learning modules Mahout - a popular Java ML library Deeplearning4j - a deep learning library Summary Chapter 2: First Steps in Data Analysis Datasets Data cleaning and munging Basic analysis of data with Spark SQL Building SparkConf and context Dataframe and datasets Load and parse data Analyzing data - the Spark-SQL way Spark SQL for data exploration and analytics Market basket analysis - Apriori algorithm Implementation of the Apriori algorithm in Apache Spark Efficient market basket analysis using FP-Growth algorithm Running FP-Growth on Apache Spark Summary Chapter 3: Data Visualization Data visualization with Java JFreeChart Using charts in big data analytics Time Series chart All India seasonal and annual average temperature series dataset Simple single Time Series chart Multiple Time Series on a single chart window Bar charts Histograms When would you use a histogram? How to make histograms using JFreeChart? Line charts Scatter plots Box plots Advanced visualization technique Prefuse IVTK Graph toolkit Other libraries Summary Chapter 4: Basics of Machine Learning What is machine learning? Real-life examples of machine learning Type of machine learning A small sample case study of supervised and unsupervised learning Steps for machine learning problems Choosing the machine learning model What are the feature types that can be extracted from the datasets? How do you select the best features to train your models? How do you run machine learning analytics on big data? Getting and preparing data in Hadoop Training and storing models on big data Apache Spark machine learning API Summary Chapter 5: Regression on Big Data Linear regression What is simple linear regression? Where is linear regression used? Logistic regression Which mathematical functions does logistic regression use? Where is logistic regression used? Predicting heart disease using logistic regression Summary Chapter 6: Naive Bayes and Sentiment Analysis Conditional probability Bayes theorem Naive Bayes algorithm Advantages of Naive Bayes Disadvantages of Naive Bayes Sentimental analysis Concepts for sentimental analysis Tokenization Stop words removal Stemming N-grams Term presence and Term Frequency TF-IDF Bag of words Dataset Data exploration of text data Sentimental analysis on this dataset SVM or Support Vector Machine Summary Chapter 7: Decision Trees What is a decision tree? Building a decision tree Choosing the best features for splitting the datasets Dataset Data exploration Cleaning and munging the data Training and testing the model Summary Chapter 8: Ensembling on Big Data Ensembling Types of ensembling Bagging Boosting Advantages and disadvantages of ensembling Random forests Gradient boosted trees (GBTs) Classification problem and dataset used Data exploration Training and testing our random forest model Training and testing our gradient boosted tree model Summary Chapter 9: Recommendation Systems Recommendation systems and |