![]()
内容推荐 如果编程是魔法,那么网络数据采集肯定就是某种巫术。编写一个简单的自动化程序,你就可以查询Web服务器,请求数据,解析数据以提取所需的信息。这本实用书籍的扩充版不但介绍了网络数据采集,更是从现代网络中抓取几乎各类数据的综合指南。 瑞安·米切尔著的《Python网络数据采集(第2版影印版)(英文版)》第一部分侧重于网络数据采集机制:使用Python向Web服务器请求信息,对服务器响应信息做基本的处理,自动与站点展开交互。第二部分探讨了各种更具体的工具和应用程序,以应对你可能遇到的任何网络数据采集场景。 作者简介 瑞安·米切尔是位于波士顿的HedgeSe rv的高级软件工程师,负责开发公司的API和数据分析工具。她毕业于欧林工程学院,拥有哈佛大学扩展学院(HarvardUrliversity Exterlsion Sc}]001)软件工程硕士学位以及数据科学证书。在加入HedgeServ之前,她曾就职于Abine,负责使用Python开发网络数据采集工具和自动化工具。她经常从事零售、金融和制药行业的网络数据采集项目的咨询工作,还曾经在东北大学和欧林工程学院担任课程顾问和兼职教员。 目录 Preface Part I. Building Scrapers 1. Your First Web Scraper Connecting An Introduction to BeautifulSoup Installing BeautifulSoup Running BeautifulSoup Connecting Reliably and Handling Exceptions 2. Advanced HTML Parsing You Don't Always Need a Hammer Another Serving of BeautifulSoup findo and findallo with BeautifulSoup Other BeautifulSoup Objects Navigating Trees Regular Expressions Regular Expressions and BeautifulSoup Accessing Attributes Lambda Expressions 3. Writing Web Crawlers Traversing a Single Domain Crawling an Entire Site Collecting Data Across an Entire Site Crawling Across the Internet 4. Web Crawling Models Planning and Defining Objects Dealing with Different Website Layouts Structuring Crawlers Crawling Sites Through Search Crawling Sites Through Links Crawling Multiple Page Types Thinking About Web Crawler Models 5. Scrapy Installing Scrapy Initializing a New Spider Writing a Simple Scraper Spidering with Rules Creating Items Outputting Items The Item Pipeline Logging with Scrapy More Resources 6. St0ring Data Media Files Storing Data to CSV MySQL Installing MySQL Some Basic Commands Integrating with Python Database Techniques and Good Practice "Six Degrees" in MySQL Email Part II. Advanced Scraping 7. Reading Documents Document Encoding Text Text Encoding and the Global Internet CSV Reading CSV Files PDF Microsoft Word and .docx 8. Cleaning Your Dirty Data Cleaning in Code Data Normalization Cleaning After the Fact OpenRefine 9. Reading and Writing Natural Languages Summarizing Data Markov Models Six Degrees of Wikipedia: Conclusion Natural Language Toolkit Installation and Setup Statistical Analysis with NLTK Lexicographical Analysis with NLTK Additional Resources 10. Crawling Through Forms and Logins Python Requests Library Submitting a Basic Form Radio Buttons, Checkboxes, and Other Inputs Submitting Files and Images Handling Logins and Cookies HTTP Basic Access Authentication Other Form Problems 11. Scraping JavaScript A Brief Introduction to JavaScript Common JavaScript Libraries Ajax and Dynamic HTML Executing JavaScript in Python with Selenium Additional Selenium Webdrivers Handling Redirects A Final Note on JavaScript 12. Crawling Through APIs A Brief Introduction to APIs HTTP Methods and APIs More About API Responses Parsing JSON Undocumented APIs Finding Undocumented APIs Documenting Undocumented APIs Finding and Documenting APIs Automatically Combining APIs with Other Data Sources More About APIs 13. Image Processing and Text Recognition Overview of Libraries Pillow Tesseract NumPy Processing Well-Formatted Text Adjusting Images Automatically Scraping Text from Images on Websites Reading CAPTCHAs and Training Tesseract Training Tesseract Retrieving CAPTCHAs and Submitting Solutions 14. Avoiding Scraping Traps A Note on Ethics Looking Like a Human Adjust Your Headers Handling Cookies with JavaScript Timing Is Everything Common Form Security Features Hidden Input Field Values Avoiding Honeypots The Human Checklist 15. Testing Your Website with Scrapers An Introduction to Testing What Are Unit Tests? Python unittest Testing Wikipedia Testing with Selenium Interacting with the Site unittest or Selenium? 16. Web Crawling in Parallel Processes versus Threads Multithreaded Crawling Race Conditions and Queues The threading Module Multiprocess Crawling Multiprocess Crawling Communicating Between Processes Multiprocess Crawling--Another Approach 17. Scraping Rem0tely Why Use Remote Servers? Avoiding IP Address Blocking Portability and Extensibility Tor PySocks Remote Hosting Running from a Website-Hosting Account Running from the Cloud Additional Resources 18. The Legalities and Ethics of Web Scraping Trademarks, Copyrights, Pa |