如何用30行代碼爬取Google Play 100萬個App的數據

基礎工作:

內置元素選擇器

序列化和存儲數據

處理cookie、HTTP頭這些東西的中間件

爬取 Sitemap 或者 RSS

等等

我的需求是爬取 Google Play 市場上的所有 App 的頁面鏈接以及下載數量。

首先確保配置好 Python 2.7, MongoDB 數據庫, 以及 Python 的 pip 包管理系統。

然後安裝對應的 Python 包並且生成項目模板：

pip install scrapy scrapy-mongodb

scrapy startproject app

cd app

scrapy genspider google

然後把 app/spider/google.py 換成下面的內容：

`# -*- coding: utf-8 -*-

import scrapy

from scrapy.contrib.spiders import CrawlSpider, Rule

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from scrapy.contrib.linkextractors import LinkExtractor

from app.items import GoogleItem

class GoogleSpider(CrawlSpider):

name = "google"

allowed_domains = ["play.google.com"]

start_urls = [