网站首页  软件下载  游戏下载  翻译软件  电子书下载  电影下载  电视剧下载  教程攻略

请输入您要查询的图书:

 

书名 Web数据挖掘(超文本数据的知识发现英文版)/图灵原版计算机科学系列
分类
作者 (印度)查凯莱巴蒂
出版社 人民邮电出版社
下载
简介
编辑推荐

本书是Web挖掘与搜索引擎领域的经典著作,自出版以来深受好评,已经被斯坦福、普林斯顿、卡内基梅隆等世界名校采用为教材。书中首先介绍了Web爬行和搜索等许多基础性的问题,并以此为基础,深入阐述了解决Web挖掘各种难题所涉及的机器学习技术,提出了机器学习在系统获取、存储和分析数据中的许多应用,并探讨了这些应用的优劣和发展前景。

全书分析透彻,富于前瞻性,为构建Web挖掘创新性应用奠定了理论和实践基础,既适用于信息检索和机器学习领域的研究人员和高校师生,也是广大Web开发人员的优秀参考书。

内容推荐

本书是信息检索领域的名著,深入讲解了从大量非结构化Web数据中提取和产生知识的技术。书中首先论述了Web的基础(包括Web信息采集机制、Web标引机制以及基于关键字或基于相似性搜索机制),然后系统地描述了Web挖掘的基础知识,着重介绍基于超文本的机器学习和数据挖掘方法,如聚类、协同过滤、监督学习、半监督学习,最后讲述了这些基本原理在Web挖掘中的应用。本书为读者提供了坚实的技术背景和最新的知识。

本书是从事数据挖掘学术研究和开发的专业人员理想的参考书,同时也适合作为高等院校计算机及相关专业研究生的教材。

目录

INTRODUCTION

 1.1  Crawling and Indexing  6

 1.2 Topic Directories  7

 1.3 Clustering and Classification  8

 1.4 Hyperlink Analysis  9

 1.5 Resource Discovery and Vertical Portals  11

 1.6 Structured vs. Unstructured Data Mining  11

 1.7 Bibliographic Notes  13

PART Ⅰ INFRASTRUCTURE

2   CRAWLING THE WEB

 2.1 HTML and HTTP Basics  18

 2.2 Crawling Basics  19

 2.3 Engineering Large-Scale Crawlers  21

    2.3.1 DNS Caching, Prefetching, and Resolution  22

    2.3.2 Multiple Concurrent Fetches  23

    2.3.3 Link Extraction and Normalization  25

    2.3.4 Robot Exclusion  26

    2.3.5 Eliminating Already-Visited URLs  26

    2.3.6 Spider Traps  28

    2.3.7 Avoiding Repeated Expansion of Links on Duplicate Pages  29

    2.3.8 Load Monitor and Manager  29

    2.3.9 Per-Server Work-Queues  30

    2.3.10 Text Repository  31

    2.3.11 Refreshing Crawled Pages  33

 2.4 Putting Together a Crawler  35

    2.4.1 Design of the Core Components  35

    2.4.2 Case Study: Using w3c-libwww  40

 2.5 Bibliographic Notes  40

3   WEB SEARCH AND INFORMATION RETRIEVAL

 3.1 Boolean Queries and the Inverted Index  45

     3.1.1 Stopwords and Stemming  48

     3.1.2 Batch Indexing and Updates  49

     3.1.3 Index Compression Techniques  51

 3.2 Relevance Ranking  53

     3.2.1 Recall and Precision  53

     3.2.2 The Vector-Space Model  56

     3.2.3 Relevance Feedback and Rocchio's Method  57

     3.2.4 Probabilistic Relevance Feedback Models  58

     3.2.5 Advanced Issues  61

 3.3 Similarity Search  67

     3.3.1 Handling"Find-Similar" Queries  68

     3.3.2 Eliminating Near Duplicates via Shingling  71

     3.3.3 Detecting Locally Similar Subgraphs of the Web  73

 3.4 Bibliographic Notes  75

PART Ⅱ LEARNING

4   SIMILARITY AND CLUSTERING

 4.1  Formulations and Approaches  81

     4.1.1 Partitioning Approaches  81

     4.1.2 Geometric Embedding Approaches  82

     4.1.3 Generative Models and Probabilistic Approaches  83

 4.2 Bottom-Up and Top-Down Partitioning Paradigms  84

     4.2.1 Agglomerative Clustering  84

     4.2.2 The k-Means Algorithm  87

 4.3 Clustering and Visualization via Embeddings  89

     4.3.1 Self-Organizing Maps (SOMs)  90

     4.3.2 Multidimensional Scaling (MDS) and FastMap  91

     4.3.3 Projections and Subspaces  94

     4.3.4 Latent Semantic Indexing (LSI)  96

 4.4 Probabilistic Approaches to Clnstermg  99

     4.4.1 Generative Distributions for Documents  101

     4.4.2 Mixture Models and Expectation Maximization (EM)  103

     4.4.3 Multiple Cause Mixture Model (MCMM)  108

     4.4.4 Aspect Models and Probabilistic LSI  109

     4.4.5 Model and Feature Selection  112

 4.5 Collaborative Filtering  115

     4.5.1 Probabilistic Models  115

     4.5.2 Combining Content-Based and Collaborative Features  117

 4.6 Bibliographic Notes  121

5   SUPERVISED LEARNING

 5.1 The Supervised Learning Scenario  126

 5.2 Overview of Classification Strategies  128

 5.3 Evaluating Text Classifiers  129

    5.3.1 Benchmarks  130

    5.3.2 Measures of Accuracy  131

 5.4 Nearest Neighbor Learners  133

    5.4.1 Pros and Cons  134

    5.4.2 Is TFIDF Appropriate?  135

 5.5 Feature Selection  136

    5.5.1 Greedy Inclusion Algorithms  137

    5.5.2 Truncation Algorithms  144

    5.5.3 Comparison and Discussion  145

 5.6 Bayesian Learners  147

    5.6.1 Naive Bayes Learners  148

    5.6.2 SmaU-Degree Bayesian Networks  152

 5.7 Exploiting Hierarchy among Topics  155

    5.7.1 Feature Selection  155

    5.7.2 Enhanced Parameter Estimation  155

    5.7.3 Training and Search Strategies  157

 5.8 Maximum Entropy Learners  160

 5.9 Discriminative Classification  163

    5.9.1 Linear Least-Square Regression  163

    5.9.2 Support Vector Machines  164

 5.10 Hypertext Classification  169

    5.10.1 Representing Hypertext for Supervised Learning  169

    5.10.2 Rule Induction  171

 5.11 Bibliographic Notes  173

6   SEMISUPERVISED LEARNING

 6.1 Expectation Maximization  178

     6.1.1 Experimental Results  179

     6.1.2 Reducing the Belief in Unlabeled Documents  181

     6.1.3 Modeling Labels Using Many Mixture Components  183

 6.2 Labeling Hypertext Graphs  184

     6.2.1 Absorbing Features from Neighboring Pages  185

     6.2.2 A Relaxation Labeling Algorithm  188

     6.2.3 A Metric Graph-Labeling Problem  193

 6.3 Co-training  195

 6.4 Bibliographic Notes  198

PART Ⅲ APPLICATIONS

7  SOCIAL NETWORK ANALYSIS

 7.1  Social Sciences and Bibliometry  205

    7.1.1 Prestige  205

    7.1.2 Centrality  206

    7.1.3 Co-citation  207

 7.2 PageRank and HITS  209

    7.2.1 PageRank  209

    7.2.2 HITS  212

    7.2.3 Stochastic HITS and Other Variants  216

 7.3 Shortcomings of the Coarse-Grained Graph Model  219

    7.3.1 Artifacts of Web Authorship  219

    7.3.2 Topic Contamination and Drift  223

 7.4 Enhanced Models and Techniques  225

    7.4.1 Avoiding Two-Party Nepotism  225

    7.4.2 Outlier Elimination  226

    7.4.3 Exploiting Anchor Text  227

    7.4.4 Exploiting Document Markup Structure  228

 7.5 Evaluation of Topic Distillation  235

    7.5.1 HITS and Related Algorithms  235

    7.5.2 Effect of Exploiting Other Hypertext Features  238

 7.6 Measuring and Modeling the Web  243

    7.6.1 Power-Law Degree Distributions  243

    7.6.2 The "Bow Tie" Structure and Bipartite Cores  246

    7.6.3 Sampling Web Pages at Random  246

 7.7 Bibliographic Notes  254

8  RESOURCE DISCOVERY

 8.1  Collecting Important Pages Preferentially  257

    8.1.1 Crawling as Guided Search in a Graph  257

    8.1.2 Keyword-Based Graph Search  259

 8.2 Similarity Search Using Link Topology  264

 8.3 Topical Locality and Focused Crawling  268

     8.3.1 Focused Crawling  270

     8.3.2 Identifying and Exploiting Hubs  277

     8.3.3 Learning Context Graphs  279

     8.3.4 Reinforcement Learning  280

 8.4 Discovering Communities  284

     8.4.1 Bipartite Cores as Communities  284

     8.4.2 Network Flow/Cut-Based Notions of Communities  285

 8.5 Bibliographic Notes  288

9  THE FUTURE OF WEB MINING

 9.1  Information Extraction  290

 9.2 Natural Language Processing  295

     9.2.1 Lexical Networks and Ontologies  296

     9.2.2 Part-of-Speech and Sense Tagging  297

     9.2.3 Parsing and Knowledge Representation  299

 9.3 Question Answering  302

 9.4 Profiles, Personalization, and Collaboration  305

References  307

Index  327

随便看

 

霍普软件下载网电子书栏目提供海量电子书在线免费阅读及下载。

 

Copyright © 2002-2024 101bt.net All Rights Reserved
更新时间:2025/3/1 20:11:23