Web数据挖掘(超文本数据的知识发现英文版)/图灵原版计算机科学系列(印度)查凯莱巴蒂人民邮电出版社豆瓣PDF电子书bt网盘迅雷下载-霍普软件下载网

INTRODUCTION

1.1 Crawling and Indexing 6

1.2 Topic Directories 7

1.3 Clustering and Classification 8

1.4 Hyperlink Analysis 9

1.5 Resource Discovery and Vertical Portals 11

1.6 Structured vs. Unstructured Data Mining 11

1.7 Bibliographic Notes 13

PART Ⅰ INFRASTRUCTURE

2 CRAWLING THE WEB

2.1 HTML and HTTP Basics 18

2.2 Crawling Basics 19

2.3 Engineering Large-Scale Crawlers 21

2.3.1 DNS Caching, Prefetching, and Resolution 22

2.3.2 Multiple Concurrent Fetches 23

2.3.3 Link Extraction and Normalization 25

2.3.4 Robot Exclusion 26

2.3.5 Eliminating Already-Visited URLs 26

2.3.6 Spider Traps 28

2.3.7 Avoiding Repeated Expansion of Links on Duplicate Pages 29

2.3.8 Load Monitor and Manager 29

2.3.9 Per-Server Work-Queues 30

2.3.10 Text Repository 31

2.3.11 Refreshing Crawled Pages 33

2.4 Putting Together a Crawler 35

2.4.1 Design of the Core Components 35

2.4.2 Case Study: Using w3c-libwww 40

2.5 Bibliographic Notes 40

3 WEB SEARCH AND INFORMATION RETRIEVAL

3.1 Boolean Queries and the Inverted Index 45

3.1.1 Stopwords and Stemming 48

3.1.2 Batch Indexing and Updates 49

3.1.3 Index Compression Techniques 51

3.2 Relevance Ranking 53

3.2.1 Recall and Precision 53

3.2.2 The Vector-Space Model 56

3.2.3 Relevance Feedback and Rocchio's Method 57

3.2.4 Probabilistic Relevance Feedback Models 58

3.2.5 Advanced Issues 61

3.3 Similarity Search 67

3.3.1 Handling"Find-Similar" Queries 68

3.3.2 Eliminating Near Duplicates via Shingling 71

3.3.3 Detecting Locally Similar Subgraphs of the Web 73

3.4 Bibliographic Notes 75

PART Ⅱ LEARNING

4 SIMILARITY AND CLUSTERING

4.1 Formulations and Approaches 81

4.1.1 Partitioning Approaches 81

4.1.2 Geometric Embedding Approaches 82

4.1.3 Generative Models and Probabilistic Approaches 83

4.2 Bottom-Up and Top-Down Partitioning Paradigms 84

4.2.1 Agglomerative Clustering 84

4.2.2 The k-Means Algorithm 87

4.3 Clustering and Visualization via Embeddings 89

4.3.1 Self-Organizing Maps (SOMs) 90

4.3.2 Multidimensional Scaling (MDS) and FastMap 91

4.3.3 Projections and Subspaces 94

4.3.4 Latent Semantic Indexing (LSI) 96

4.4 Probabilistic Approaches to Clnstermg 99

4.4.1 Generative Distributions for Documents 101

4.4.2 Mixture Models and Expectation Maximization (EM) 103

4.4.3 Multiple Cause Mixture Model (MCMM) 108

4.4.4 Aspect Models and Probabilistic LSI 109

4.4.5 Model and Feature Selection 112

4.5 Collaborative Filtering 115

4.5.1 Probabilistic Models 115

4.5.2 Combining Content-Based and Collaborative Features 117

4.6 Bibliographic Notes 121

5 SUPERVISED LEARNING

5.1 The Supervised Learning Scenario 126

5.2 Overview of Classification Strategies 128

5.3 Evaluating Text Classifiers 129

5.3.1 Benchmarks 130

5.3.2 Measures of Accuracy 131

5.4 Nearest Neighbor Learners 133

5.4.1 Pros and Cons 134

5.4.2 Is TFIDF Appropriate? 135

5.5 Feature Selection 136

5.5.1 Greedy Inclusion Algorithms 137

5.5.2 Truncation Algorithms 144

5.5.3 Comparison and Discussion 145

5.6 Bayesian Learners 147

5.6.1 Naive Bayes Learners 148

5.6.2 SmaU-Degree Bayesian Networks 152

5.7 Exploiting Hierarchy among Topics 155

5.7.1 Feature Selection 155

5.7.2 Enhanced Parameter Estimation 155

5.7.3 Training and Search Strategies 157

5.8 Maximum Entropy Learners 160

5.9 Discriminative Classification 163

5.9.1 Linear Least-Square Regression 163

5.9.2 Support Vector Machines 164

5.10 Hypertext Classification 169

5.10.1 Representing Hypertext for Supervised Learning 169

5.10.2 Rule Induction 171

5.11 Bibliographic Notes 173

6 SEMISUPERVISED LEARNING

6.1 Expectation Maximization 178

6.1.1 Experimental Results 179

6.1.2 Reducing the Belief in Unlabeled Documents 181

6.1.3 Modeling Labels Using Many Mixture Components 183

6.2 Labeling Hypertext Graphs 184

6.2.1 Absorbing Features from Neighboring Pages 185

6.2.2 A Relaxation Labeling Algorithm 188

6.2.3 A Metric Graph-Labeling Problem 193

6.3 Co-training 195

6.4 Bibliographic Notes 198

PART Ⅲ APPLICATIONS

7 SOCIAL NETWORK ANALYSIS

7.1 Social Sciences and Bibliometry 205

7.1.1 Prestige 205

7.1.2 Centrality 206

7.1.3 Co-citation 207

7.2 PageRank and HITS 209

7.2.1 PageRank 209

7.2.2 HITS 212

7.2.3 Stochastic HITS and Other Variants 216

7.3 Shortcomings of the Coarse-Grained Graph Model 219

7.3.1 Artifacts of Web Authorship 219

7.3.2 Topic Contamination and Drift 223

7.4 Enhanced Models and Techniques 225

7.4.1 Avoiding Two-Party Nepotism 225

7.4.2 Outlier Elimination 226

7.4.3 Exploiting Anchor Text 227

7.4.4 Exploiting Document Markup Structure 228

7.5 Evaluation of Topic Distillation 235

7.5.1 HITS and Related Algorithms 235

7.5.2 Effect of Exploiting Other Hypertext Features 238

7.6 Measuring and Modeling the Web 243

7.6.1 Power-Law Degree Distributions 243

7.6.2 The "Bow Tie" Structure and Bipartite Cores 246

7.6.3 Sampling Web Pages at Random 246

7.7 Bibliographic Notes 254

8 RESOURCE DISCOVERY

8.1 Collecting Important Pages Preferentially 257

8.1.1 Crawling as Guided Search in a Graph 257

8.1.2 Keyword-Based Graph Search 259

8.2 Similarity Search Using Link Topology 264

8.3 Topical Locality and Focused Crawling 268

8.3.1 Focused Crawling 270

8.3.2 Identifying and Exploiting Hubs 277

8.3.3 Learning Context Graphs 279

8.3.4 Reinforcement Learning 280

8.4 Discovering Communities 284

8.4.1 Bipartite Cores as Communities 284

8.4.2 Network Flow/Cut-Based Notions of Communities 285

8.5 Bibliographic Notes 288

9 THE FUTURE OF WEB MINING

9.1 Information Extraction 290

9.2 Natural Language Processing 295

9.2.1 Lexical Networks and Ontologies 296

9.2.2 Part-of-Speech and Sense Tagging 297

9.2.3 Parsing and Knowledge Representation 299

9.3 Question Answering 302

9.4 Profiles, Personalization, and Collaboration 305

References 307

Index 327

书名	Web数据挖掘(超文本数据的知识发现英文版)/图灵原版计算机科学系列
分类
作者	(印度)查凯莱巴蒂
出版社	人民邮电出版社
下载
简介	编辑推荐本书是Web挖掘与搜索引擎领域的经典著作，自出版以来深受好评，已经被斯坦福、普林斯顿、卡内基梅隆等世界名校采用为教材。书中首先介绍了Web爬行和搜索等许多基础性的问题，并以此为基础，深入阐述了解决Web挖掘各种难题所涉及的机器学习技术，提出了机器学习在系统获取、存储和分析数据中的许多应用，并探讨了这些应用的优劣和发展前景。全书分析透彻，富于前瞻性，为构建Web挖掘创新性应用奠定了理论和实践基础，既适用于信息检索和机器学习领域的研究人员和高校师生，也是广大Web开发人员的优秀参考书。内容推荐本书是信息检索领域的名著，深入讲解了从大量非结构化Web数据中提取和产生知识的技术。书中首先论述了Web的基础(包括Web信息采集机制、Web标引机制以及基于关键字或基于相似性搜索机制)，然后系统地描述了Web挖掘的基础知识，着重介绍基于超文本的机器学习和数据挖掘方法，如聚类、协同过滤、监督学习、半监督学习，最后讲述了这些基本原理在Web挖掘中的应用。本书为读者提供了坚实的技术背景和最新的知识。本书是从事数据挖掘学术研究和开发的专业人员理想的参考书，同时也适合作为高等院校计算机及相关专业研究生的教材。目录 INTRODUCTION 1.1 Crawling and Indexing 6 1.2 Topic Directories 7 1.3 Clustering and Classification 8 1.4 Hyperlink Analysis 9 1.5 Resource Discovery and Vertical Portals 11 1.6 Structured vs. Unstructured Data Mining 11 1.7 Bibliographic Notes 13 PART Ⅰ INFRASTRUCTURE 2 CRAWLING THE WEB 2.1 HTML and HTTP Basics 18 2.2 Crawling Basics 19 2.3 Engineering Large-Scale Crawlers 21 2.3.1 DNS Caching, Prefetching, and Resolution 22 2.3.2 Multiple Concurrent Fetches 23 2.3.3 Link Extraction and Normalization 25 2.3.4 Robot Exclusion 26 2.3.5 Eliminating Already-Visited URLs 26 2.3.6 Spider Traps 28 2.3.7 Avoiding Repeated Expansion of Links on Duplicate Pages 29 2.3.8 Load Monitor and Manager 29 2.3.9 Per-Server Work-Queues 30 2.3.10 Text Repository 31 2.3.11 Refreshing Crawled Pages 33 2.4 Putting Together a Crawler 35 2.4.1 Design of the Core Components 35 2.4.2 Case Study: Using w3c-libwww 40 2.5 Bibliographic Notes 40 3 WEB SEARCH AND INFORMATION RETRIEVAL 3.1 Boolean Queries and the Inverted Index 45 3.1.1 Stopwords and Stemming 48 3.1.2 Batch Indexing and Updates 49 3.1.3 Index Compression Techniques 51 3.2 Relevance Ranking 53 3.2.1 Recall and Precision 53 3.2.2 The Vector-Space Model 56 3.2.3 Relevance Feedback and Rocchio's Method 57 3.2.4 Probabilistic Relevance Feedback Models 58 3.2.5 Advanced Issues 61 3.3 Similarity Search 67 3.3.1 Handling"Find-Similar" Queries 68 3.3.2 Eliminating Near Duplicates via Shingling 71 3.3.3 Detecting Locally Similar Subgraphs of the Web 73 3.4 Bibliographic Notes 75 PART Ⅱ LEARNING 4 SIMILARITY AND CLUSTERING 4.1 Formulations and Approaches 81 4.1.1 Partitioning Approaches 81 4.1.2 Geometric Embedding Approaches 82 4.1.3 Generative Models and Probabilistic Approaches 83 4.2 Bottom-Up and Top-Down Partitioning Paradigms 84 4.2.1 Agglomerative Clustering 84 4.2.2 The k-Means Algorithm 87 4.3 Clustering and Visualization via Embeddings 89 4.3.1 Self-Organizing Maps (SOMs) 90 4.3.2 Multidimensional Scaling (MDS) and FastMap 91 4.3.3 Projections and Subspaces 94 4.3.4 Latent Semantic Indexing (LSI) 96 4.4 Probabilistic Approaches to Clnstermg 99 4.4.1 Generative Distributions for Documents 101 4.4.2 Mixture Models and Expectation Maximization (EM) 103 4.4.3 Multiple Cause Mixture Model (MCMM) 108 4.4.4 Aspect Models and Probabilistic LSI 109 4.4.5 Model and Feature Selection 112 4.5 Collaborative Filtering 115 4.5.1 Probabilistic Models 115 4.5.2 Combining Content-Based and Collaborative Features 117 4.6 Bibliographic Notes 121 5 SUPERVISED LEARNING 5.1 The Supervised Learning Scenario 126 5.2 Overview of Classification Strategies 128 5.3 Evaluating Text Classifiers 129 5.3.1 Benchmarks 130 5.3.2 Measures of Accuracy 131 5.4 Nearest Neighbor Learners 133 5.4.1 Pros and Cons 134 5.4.2 Is TFIDF Appropriate? 135 5.5 Feature Selection 136 5.5.1 Greedy Inclusion Algorithms 137 5.5.2 Truncation Algorithms 144 5.5.3 Comparison and Discussion 145 5.6 Bayesian Learners 147 5.6.1 Naive Bayes Learners 148 5.6.2 SmaU-Degree Bayesian Networks 152 5.7 Exploiting Hierarchy among Topics 155 5.7.1 Feature Selection 155 5.7.2 Enhanced Parameter Estimation 155 5.7.3 Training and Search Strategies 157 5.8 Maximum Entropy Learners 160 5.9 Discriminative Classification 163 5.9.1 Linear Least-Square Regression 163 5.9.2 Support Vector Machines 164 5.10 Hypertext Classification 169 5.10.1 Representing Hypertext for Supervised Learning 169 5.10.2 Rule Induction 171 5.11 Bibliographic Notes 173 6 SEMISUPERVISED LEARNING 6.1 Expectation Maximization 178 6.1.1 Experimental Results 179 6.1.2 Reducing the Belief in Unlabeled Documents 181 6.1.3 Modeling Labels Using Many Mixture Components 183 6.2 Labeling Hypertext Graphs 184 6.2.1 Absorbing Features from Neighboring Pages 185 6.2.2 A Relaxation Labeling Algorithm 188 6.2.3 A Metric Graph-Labeling Problem 193 6.3 Co-training 195 6.4 Bibliographic Notes 198 PART Ⅲ APPLICATIONS 7 SOCIAL NETWORK ANALYSIS 7.1 Social Sciences and Bibliometry 205 7.1.1 Prestige 205 7.1.2 Centrality 206 7.1.3 Co-citation 207 7.2 PageRank and HITS 209 7.2.1 PageRank 209 7.2.2 HITS 212 7.2.3 Stochastic HITS and Other Variants 216 7.3 Shortcomings of the Coarse-Grained Graph Model 219 7.3.1 Artifacts of Web Authorship 219 7.3.2 Topic Contamination and Drift 223 7.4 Enhanced Models and Techniques 225 7.4.1 Avoiding Two-Party Nepotism 225 7.4.2 Outlier Elimination 226 7.4.3 Exploiting Anchor Text 227 7.4.4 Exploiting Document Markup Structure 228 7.5 Evaluation of Topic Distillation 235 7.5.1 HITS and Related Algorithms 235 7.5.2 Effect of Exploiting Other Hypertext Features 238 7.6 Measuring and Modeling the Web 243 7.6.1 Power-Law Degree Distributions 243 7.6.2 The "Bow Tie" Structure and Bipartite Cores 246 7.6.3 Sampling Web Pages at Random 246 7.7 Bibliographic Notes 254 8 RESOURCE DISCOVERY 8.1 Collecting Important Pages Preferentially 257 8.1.1 Crawling as Guided Search in a Graph 257 8.1.2 Keyword-Based Graph Search 259 8.2 Similarity Search Using Link Topology 264 8.3 Topical Locality and Focused Crawling 268 8.3.1 Focused Crawling 270 8.3.2 Identifying and Exploiting Hubs 277 8.3.3 Learning Context Graphs 279 8.3.4 Reinforcement Learning 280 8.4 Discovering Communities 284 8.4.1 Bipartite Cores as Communities 284 8.4.2 Network Flow/Cut-Based Notions of Communities 285 8.5 Bibliographic Notes 288 9 THE FUTURE OF WEB MINING 9.1 Information Extraction 290 9.2 Natural Language Processing 295 9.2.1 Lexical Networks and Ontologies 296 9.2.2 Part-of-Speech and Sense Tagging 297 9.2.3 Parsing and Knowledge Representation 299 9.3 Question Answering 302 9.4 Profiles, Personalization, and Collaboration 305 References 307 Index 327
随便看	带着丧尸奔小康冰山与蠢萌的火花回宫！回宫！！奈何是你我的格桑塔 [POI情人节贺文]颜色王子的青蛙王妃父子刑警吐槽围观全职高手喻黄情人节在远方都市阴阳师此间少年（CP：越前龙马X凯宾·史密斯）古韵物语，聆爱万千一觉醒来世界都变了土地公土地婆手造天堂送行者我想上头条啊[星际] 幕微未完待续迷糊宝贝要翻天久伴的深情牢笼 [快穿]杀死反派扒一扒我那个帅到惨绝人寰的宿主 Ultimate Vocal Remover GUI v5.4.0 ultimate vocal remover5 v2.56 Ultimate Vocal Remover GUI v5.4.0 ultimate vocal remover5 v2.56 Ultimate Vocal Remover GUI v5.4.0 ultimate vocal remover5 v2.56 Ultimate Vocal Remover GUI v5.4.0 ultimate vocal remover5 v2.56 Ultimate Vocal Remover GUI v5.4.0 ultimate vocal remover5 v2.56 dnf男枪全改黄金套 v1.4 腐烂国度2主宰版V19版难度MOD v2.64 怪物猎人世界冰原可爱露露亚外观MOD v2.21 正当防卫2十项属性修改器 v3.4 胡巴火影忍者ol辅助工具 v1.4.88.162 了不起的修仙模拟器修改筑基丹止痛药配方效果MOD v2.3 荒野大镖客2防暴模式MOD v1.71 红警全能王 2016 海贼无双3艾尼路黑暗版MOD v2.3 风暴英雄智能对战辅助(匹配练习) v5.26 associate's degree association association football assonance assorted assortment asst assuage assume assuming (that) [BT下载][九部的检察官][全18集][WEB-MKV/88.38G][国语配音/中文字幕][4K-2160P][H265][流媒体][ParkTV] [BT下载][九部的检察官][全18集][WEB-MKV/22.34G][国语配音/中文字幕][4K-2160P][H265][流媒体][ParkTV] [BT下载][九部的检察官][全18集][WEB-MP4/74.18G][国语配音/中文字幕][4K-2160P][杜比视界版本][H265][流媒 [BT下载][九部的检察官][全18集][WEB-MKV/62.69G][国语配音/中文字幕][4K-2160P][HDR版本][H265][流媒体][P [BT下载][九部的检察官][全18集][WEB-MKV/62.99G][国语配音/中文字幕][4K-2160P][HDR+杜比视界双版本][H265 [BT下载][事与愿违的不死冒险者][第10集][WEB-MKV/0.26G][中文字幕][1080P][流媒体][ParkTV] [BT下载][公主变形记][全18集][WEB-MKV/8.40G][国语配音/中文字幕][1080P][流媒体][ZeroTV] [BT下载][冠军请指教][第01-10集][WEB-MKV/8.85G][国语配音/中文字幕][1080P][H265][流媒体][ZeroTV] [BT下载][凌晨两点的灰姑娘][第02集][WEB-MKV/0.92G][中文字幕][1080P][流媒体][ParkTV] [BT下载][凌晨两点的灰姑娘][第02集][WEB-MKV/1.60G][中文字幕][4K-2160P][H265][流媒体][ParkTV] 《炉石传说》大哥德卡组构筑分享《炉石传说》星舰骑构筑分享星舰骑卡组推荐《炉石传说》地标术构筑分享地标术卡组推荐《炉石传说》星灵牧构筑分享星灵牧卡组推荐《炉石传说》巨像冰法构筑分享巨像冰法卡组推荐《炉石传说》异虫冰DK构筑分享异虫冰DK卡组推荐《炉石传说》星灵勇士法卡组构筑分享《无畏契约》1月23日-1月29日外网皮肤排名《鸣潮》洛可可养成一图流洛可可武器及声骸推荐《炉石传说》巨像法构筑攻略巨像法卡组分享