利用scrapy-redis爬取链家百万成交记录

新年的钟声敲响之际,我也敲定了链家成交记录的爬虫代码,从北京开始,一共近百万成交记录。代码总共用了两天,关键代码1.31号写完,元旦完成全部代码修补了几个小bug,先亮一下爬取的字段:

{
	"_id" : "101100313085",
	"index_url" : "https://bj.lianjia.com/chengjiao/dongcheng/pg1p2l2a2/",
	"detail_url" : "https://bj.lianjia.com/chengjiao/101100313085.html",
	"house_code" : "101100313085",
	"detail_title" : "定安里 2室1厅 54.34平米",
	"detail_dirFurnish" : "南 北 | 简装",
	"detail_dealDate" : "2016.07.21",
	"detail_floor" : "中楼层(共6层) 1980年建板楼",
	"detail_totalPrice" : "240",
	"detail_unitPrice" : "44167",
	"detail_listPrice" : "挂牌240万",
	"aprt_dealCycle" : "成交周期7天",
	"detail_agentName" : "李玉军",
	"detail_agentId" : "1000000010080328",
	"detail_dealInfo" : "定安里 2室1厅 54.34平米2016.07.21 成交",
	"detail_dealBread" : "北京房产网北京二手房成交东城二手房成交永定门二手房成交定安里二手房成交",
	"detail_priceChangeTimes" : "0",
	"detail_visitTimes" : "",
	"detail_followers" : "7",
	"detail_viewTimes" : "66",
	"detail_basic_info" : {
		"房屋户型" : "2室1厅1厨1卫",
		"所在楼层" : "中楼层(共6层)",
		"建筑面积" : "54.34㎡",
		"户型结构" : "平层",
		"套内面积" : "暂无数据",
		"建筑类型" : "板楼",
		"房屋朝向" : "南 北",
		"建成年代" : "1980",
		"装修情况" : "简装",
		"建筑结构" : "混合结构",
		"供暖方式" : "集中供暖",
		"梯户比例" : "一梯两户",
		"配备电梯" : "无"
	},
	"detail_transaction_info" : {
		"链家编号" : "101100313085",
		"交易权属" : "商品房",
		"挂牌时间" : "2016-07-15",
		"房屋用途" : "普通住宅",
		"房屋年限" : "满五年",
		"房权所属" : "非共有"
	},
	"detail_transaction_history" : "240万单价44167元/平,2016-07成交",
	"community_name" : "定安里",
	"community_url" : "https://bj.lianjia.com/chengjiao/c1111027376735",
	"community_info" : {
		
	},
	"detail_features" : {
		"房源标签" : "房本满五年"
	},
	"resblockPosition" : "116.418443,39.866651",
	"city_id" : "110000",
	"city_name" : "city_name: '北京'",
	"resblockId" : "1111027376735"
}

实际在生产项目中是需要把html文件保存下来的,但是我的服务器只有区区50G空间,可用空间只有10G多点儿了,吃不消,所以爬取时尽量地采集更多的字段。

链家成交记录是有反爬的,需要使用大量代理IP,不然爬取速度会受限。

  • https://bj.lianjia.com/chengjiao/,这个是入口,很明显,每页只有30个房源,最多显示100页的限制需要通过不同分类才能全面地获取数据。我专门写了个脚本,利用区域、售价、房型三个限制条件来构造所有的url,这样能够采集所有的url,https://bj.lianjia.com/chengjiao/haidian/l3a3p4/,这个是海淀300-400万,三室,70-90平的url,可以通过解析下面的页面构造出来。
链家成交记录页面
  • 禁用cookie和redirect,设置timeout=3, retry_times=5,单IP并发数16,单机并发数我控制在了80,用了6个拨号服务器做代理20秒拨号一次,这样scrapy每秒能抓取10个房源,基本上一天多就采集完北京站的交易记录。
  • 关于adsl vps,我博客讲过多次,即使是少量,也要尽量用不同地区甚至省份的VPS,确保IP多样性和高可用率。以我的经验,江浙和广东一带的服务器IP最多,毕竟网络发达程度和互联网发达程度成正比,推荐杭州,景德镇,中山这类城市。
  • 关键代码如下。实际上,中间件至关重要,请求头, referer和代理一定要设置好并做好测试。
#!/usr/bin/env python3
# # -*- coding: utf-8 -*-
import json
import logging
import uuid
import pickle
import scrapy
from scrapy_redis import spiders
from scrapy.utils.project import get_project_settings
from scrapy_redis.utils import bytes_to_str
import redis
import random
from scrapy_redis.spiders import RedisSpider
from lianjia.items import LianjiaItem
from lianjia.log import logger
import re
import sys

class DealsSpider(RedisSpider):
    name = 'deals'
    allowed_domains = ['lianjia.com']
    # start_urls = ['http://lianjia.com/']
    redis_key = 'lianjia:start_urls'
    
    def __init__(self, *args, **kwargs):
        super(DealsSpider, self).__init__(*args, **kwargs)
            
    def parse(self, response):
        index_url = response.url
        num_found = int(response.xpath('//div[@class="total fl"]/span/text()').extract_first())
        logger.info(f'num of apartments found in {index_url}: {num_found}')        
        if num_found > 0:
            try:
                logger.debug(f'index request.meta: {response.request.meta} {index_url}')
                logger.debug(f'index request.headers: {response.request.headers} {index_url}')            
                total_pages = int(num_found/30) + 1
                aprt_list = response.xpath('//ul[@class="listContent"]/li')
                logger.info(f'num of apartments in the current_pgNum: {len(aprt_list)}')
                pattern = re.compile(r'"curPage":\d+')
                curPage_ = re.search(pattern, response.text)[0]
                patternd = re.compile(r'\d+')
                current_pgNum = int(re.search(patternd, curPage_)[0])
                logger.info(f'curPage matched: {current_pgNum}')
                logger.debug(f'debug index_url: {index_url}')
                # current_pgNum = int(response.xpath('//div[@class="contentBottom clear"]/div[@class="page-box fr"]/div[@class="page-box house-lst-page-box"]/a[@class="on"]/text()').extract_first())            
                for li in aprt_list:

                    aprt_link = self.eleMissing(li.xpath('./a/@href').extract_first())
                   
                    aprt_title = self.eleMissing(self.strJoin(li.xpath('./div[@class="info"]/div[@class="title"]/a/text()').extract()))
                    aprt_dirFurnish = self.eleMissing(self.strJoin(li.xpath('./div[@class="info"]/div[@class="address"]/div[@class="houseInfo"]/text()').extract()))
                    aprt_dealDate = self.eleMissing(self.strJoin(li.xpath('./div[@class="info"]//div[@class="dealDate"]/text()').extract()))
                    aprt_floor = self.eleMissing(self.strJoin(li.xpath('./div[@class="info"]/div[@class="flood"]/div[@class="positionInfo"]/text()').extract()))                
                    aprt_totalPrice =  self.eleMissing(li.xpath('./div[@class="info"]/div[@class="address"]/div[@class="totalPrice"]/span[@class="number"]/text()').extract_first())
                    aprt_unitPrice = self.eleMissing(li.xpath('./div[@class="info"]/div[@class="flood"]/div[@class="unitPrice"]/span[@class="number"]/text()').extract_first())
                    aprt_features = self.eleMissing(li.xpath('./div[@class="info"]/div[@class="dealHouseInfo"]/span[@class="dealHouseTxt"]/span/text()').extract_first())                
                    aprt_listPrice = self.eleMissing(self.strJoin(li.xpath('./div[@class="info"]/div[@class="dealCycleeInfo"]/span[@class="dealCycleTxt"]/span[1]/text()').extract()))
                    aprt_dealCycle = self.eleMissing(li.xpath('./div[@class="info"]/div[@class="dealCycleeInfo"]/span[@class="dealCycleTxt"]/span[2]/text()').extract_first())
                    aprt_agent_name = self.eleMissing(li.xpath('./div[@class="info"]/div[@class="agentInfoList"]/a/text()').extract_first())
                    aprt_agent_id = self.eleMissing(li.xpath('./div[@class="info"]/div[@class="agentInfoList"]/div[@class="agent_chat_btn im-talk LOGCLICKDATA"]/@data-lj_action_agent_id').extract_first())                    
                    yield scrapy.Request(url=aprt_link, meta={'detail_url': aprt_link, 'detail_title': aprt_title, 'detail_dirFurnish': aprt_dirFurnish,
                    'detail_dealDate': aprt_dealDate, 'detail_floor': aprt_floor, 'detail_totalPrice': aprt_totalPrice, 'detail_unitPrice': aprt_unitPrice,
                    'detail_sellpoint': aprt_features, 'detail_listPrice': aprt_listPrice, 'aprt_dealCycle': aprt_dealCycle, 'index_url': index_url,
                    'detail_agent_name': aprt_agent_name, 'detail_agent_id': aprt_agent_id, 'dont_redirect': True, 'referer': index_url}, callback=self.parse_item, dont_filter=False)
                if current_pgNum < total_pages:
                    pg = 'pg' + str(current_pgNum)
                    next_url = re.sub(f'/{pg}', f'/pg{current_pgNum + 1}', index_url)
                    logger.debug(f'next_url: {next_url}')
                    yield scrapy.Request(url=next_url, callback=self.parse, dont_filter=False, meta={'dont_redirect': True, 'referer': index_url})
            except Exception as e:
                logger.info(e)
                # logger.info(response.text)
                # sys.exit()
    def parse_item(self, response):    
        logger.debug(f'request.meta: {response.request.meta} {response.url}')
        logger.debug(f'request.headers: {response.request.headers} {response.url}')     
        item = LianjiaItem()
        item['index_url'] = response.meta['index_url']
        item['detail_url'] = response.meta['detail_url']
        item['house_code'] = response.meta['detail_url'].split('/')[-1].split('.')[0]
        item['_id'] = item['house_code']
        item['detail_title'] = response.meta['detail_title']
        item['detail_dirFurnish'] = response.meta['detail_dirFurnish'] 
        item['detail_dealDate'] = response.meta['detail_dealDate']
        item['detail_floor'] = response.meta['detail_floor']
        item['detail_totalPrice'] = response.meta['detail_totalPrice']
        item['detail_unitPrice'] = response.meta['detail_unitPrice']
        # item['detail_sellpoint'] = response.meta['detail_sellpoint']
        item['detail_listPrice'] = response.meta['detail_listPrice']
        if len(item['detail_listPrice']) == 0:
            item['detail_listPrice'] = self.eleMissing(response.xpath('//section[@class="wrapper"]//div[@class="msg"]/span[1]/label/text()').extract_first())
        item['aprt_dealCycle'] = response.meta['aprt_dealCycle']
        # Not all aprt_agent_id exist
        item['detail_agentName'] = response.meta['detail_agent_name']
        item['detail_agentId'] = response.meta['detail_agent_id']        
        item['detail_dealInfo'] = self.eleMissing(response.xpath('//div[@class="wrapper"]/text()').extract_first() + response.xpath('//div[@class="wrapper"]/span/text()').extract_first())
        item['detail_dealBread'] = self.eleMissing(self.strJoin(response.xpath('//section[@class="wrapper"]/div[@class="deal-bread"]/a/text()').extract()))
        item['detail_priceChangeTimes'] = self.eleMissing(response.xpath('//section[@class="wrapper"]//div[@class="msg"]/span[3]/label/text()').extract_first())
        item['detail_visitTimes'] = self.eleMissing(response.xpath('//section[@class="wrapper"]//div[@class="msg"]/span[4]/label/text()').extract_first())
        item['detail_followers'] = self.eleMissing(response.xpath('//section[@class="wrapper"]//div[@class="msg"]/span[5]/label/text()').extract_first())
        item['detail_viewTimes'] = self.eleMissing(response.xpath('//section[@class="wrapper"]//div[@class="msg"]/span[6]/label/text()').extract_first())
        basic_info_names = self.stripList(response.xpath('//section[@class="houseContentBox"]//div[@class="base"]/div[@class="content"]/ul/li/span/text()').extract())
        basic_info_values = self.stripList(response.xpath('//section[@class="houseContentBox"]//div[@class="base"]/div[@class="content"]/ul/li/text()').extract())
        item['detail_basic_info'] = dict(zip(basic_info_names, basic_info_values))
        transaction_info_names = self.stripList(response.xpath('//div[@class="transaction"]//div[@class="content"]/ul/li/span/text()').extract())
        transaction_info_values = self.stripList(response.xpath('//div[@class="transaction"]//div[@class="content"]/ul/li/text()').extract())        
        item['detail_transaction_info'] = dict(zip(transaction_info_names, transaction_info_values))   
        item['detail_transaction_history'] = self.eleMissing(self.strJoin(response.xpath('//*[@id="chengjiao_record"]/ul/li//text()').extract()))       
        # item['community_name'] = self.eleMissing(response.xpath('//*[@id="resblockCardContainer"]/div[@class="newwrap"]/div[@class="xiaoquCard"]/div[@class="xiaoqu_header clear"]/h3/span/text()').extract_first())[:-2]
        item['community_name'] = item['detail_title'].split(' ')[0]
        # item['community_url'] = response.xpath('//*[@id="resblockCardContainer"]/div[@class="newwrap"]/div[@class="xiaoquCard"]/div[@class="xiaoqu_header clear"]/a/@href').extract_first()
        pattern_url = re.compile(r'https://bj.lianjia.com/chengjiao/c\d+')
        item['community_url'] = self.eleMissing(re.search(pattern_url, response.text)[0])
        community_info_label = response.xpath('//*[@id="resblockCardContainer"]/div[@class="newwrap"]/div[@class="xiaoquCard"]/div[@class="xiaoqu_content clear"]/div[@class="xiaoqu_main fl"]/div/label/text()').extract()
        community_info_value = response.xpath('//*[@id="resblockCardContainer"]/div[@class="newwrap"]/div[@class="xiaoquCard"]/div[@class="xiaoqu_content clear"]/div[@class="xiaoqu_main fl"]/div/span/text()').extract()
        item['community_info'] = dict(zip(self.stripList(community_info_label), self.stripList(community_info_value)))
        feature_label = self.eleMissing(response.xpath('//*[@id="house_feature"]/div[@class="introContent showbasemore"]/div/div[@class="name"]/text()').extract())
        feature_value = self.eleMissing(response.xpath('//*[@id="house_feature"]/div[@class="introContent showbasemore"]/div/div[@class="content"]/a/text()').extract())
        item['detail_features'] = dict(zip(self.stripList(feature_label), self.stripList(feature_value)))
        # positionInfo: 
        pattern_pos = re.compile(r"resblockPosition:'\d+.\d+,\d+.\d+")
        pos_ = re.search(pattern_pos, response.text)[0]
        item['resblockPosition'] = self.eleMissing(re.search(r'\d+.\d+,\d+.\d+', pos_)[0])
        # city_id:
        pattern_cityId = re.compile(r"city_id: '\d+")
        cityId_ = re.search(pattern_cityId, response.text)[0]
        item['city_id'] = self.eleMissing(re.search(r'\d+', cityId_)[0])
        # city_name
        pattern_cityName = re.compile(r"city_name: '.*'")
        item['city_name'] = self.eleMissing((re.search(pattern_cityName, response.text)[0]))
        # resblockId
        pattern_resblockId = re.compile(r"resblockId:'\d+'")
        resblockId_ = re.search(pattern_resblockId, response.text)[0]
        item['resblockId'] = self.eleMissing(re.search(r'\d+', resblockId_)[0])
        yield item
    def strJoin(self, element_list):
        return ''.join(i for i in element_list)
    def eleMissing(self, element):
        if element is None:
            return ""
        else:
            return element
    def stripList(self, eleList):
        return [i.strip() for i in eleList]

苏宁易购全站爬虫(scrapy-redis分布式采集存储到mongoDB)

fiddler抓包获取移动端接口,高速采集苏宁易购全站商品而不会对对方服务器产生太多干扰

这几天爬了京东和苏宁易购的全站,京东450万左右的sku,苏宁易购280多万sku。虽然会有些漏网之鱼,但是肯定不多。两个网站都是调用的移动端APP的接口,fiddler抓包即可,只是需要注意构造的url是否合理,不然很容易漏掉很多sku。

记得京东某个副总裁说京东有500多万sku,和我这个爬虫结果基本一致,毕竟要考虑程序的错误率和反爬。京东对IP的封杀比较厉害,爬取全站估计需要十万左右的IP才够用,苏宁相对来说比较容易,一天就能爬完所有商品。

先记录下苏宁易购的爬虫过程吧。

目的:通过APP接口爬取全站商品

fiddler可以方便地抓取app接口,手机root后和PC在同一个WIFI下,手机wifi设置代理,服务器主机名填写PC的IP(局域网IP,ipconfig /all可以查询),端口填写8888(fiddler的默认端口)。这里需要注意的是,如果你PC打开了jupyter notebook,一定要确保jupyter notebook不能使用8888端口,jupyter notebook –port=7777 这样可以使用不同端口。

fiddler的具体使用google下即可,我的博客也曾经提到过。这里的接口:

https://search.suning.com/emall/mobile/clientSearch.jsonp?set=5&ps=120&channelId=MOBILE&keyword=空调免息&st=0&ci=&cf=&sc=&cityId=459&cp=26&iv=-1&ct=-1&sp=&spf=&prune=1&cat1_大家电_cat2_空调_cat3_空调免息
需要注意的是,这里用的关键词搜索,也可以用ci这个分类。这个分类代码并不是每一个三级分类都有,有的只能用关键词

苏宁爬虫的关键就是抓包构造url,这里贴出代码来:

import json
import pprint
import re
import redis
from urllib import parse

cat3_urls = []
pp = pprint.PrettyPrinter(indent=4)
with open(r'E:\splash\suning\suning_cat.json', 'r') as f:
    json_dict = json.load(f)    
#     pp.pprint(json_dict['rs'])
    print('num of cat1 categories:', len(json_dict['rs']))
    
    for i in range(1, len(json_dict['rs'])-3):
        category_1_name = json_dict['rs'][i]['dirName']
        print('category_1: ', category_1_name)
        print('category_1_id: ', json_dict['rs'][i]['id'])
        print(f'num of cat2 categories under {category_1_name}:', len(json_dict['rs'][i]))
        for j in range(len(json_dict['rs'][i]['children'])):
            try:
#                 print('name of category_2:', json_dict['rs'][i]['children'][j])
                category_2_name = json_dict['rs'][i]['children'][j]['dirName']
                print('category_2_id:', json_dict['rs'][i]['children'][j]['id'])
                num_cat3 = len(json_dict['rs'][i]['children'][j]['children'])
                print(f'num of categories in category_2_{category_2_name}:', num_cat3)  
                print(f'num_{j} cat2_{category_2_name} under category_1_{category_1_name}: ')
                pp.pprint(json_dict['rs'][i]['children'][j]['children'])
                # pp.pprint(json_dict['rs'][i]['children'][j]['children'])
                print('\n\ncategory_3:')
                for k in range(0, len(json_dict['rs'][i]['children'][j]['children'])):
                    print(f'num{k} cat3 under cat1_{category_1_name}_cat2_{category_2_name}')
                    cat_3 = json_dict['rs'][i]['children'][j]['children'][k]
                    category_3_name = cat_3['dirName']
                    category_3_id = cat_3['id']
                    try:
                        category_3_pcCi = cat_3['pcCi']
                        if '图书音像' not in category_1_name:
                            
                            cat3_url = 'https://search.suning.com/emall/mobile/clientSearch.jsonp?set=5&ps=120&channelId=MOBILE&keyword=&st=0&ci=' + category_3_pcCi + '&cf=&sc=&cityId=459&cp=0&iv=-1&ct=-1&sp=&spf=&prune=1'+ f'&cat1_{parse.quote(category_1_name)}_cat2_{parse.quote(category_2_name)}_cat3_{parse.quote(category_3_name)}'
                        else:
                            # '图书音像'大类比较特殊,不能使用pcCi
                            cat3_url = 'https://search.suning.com/emall/mobile/clientSearch.jsonp?set=5&ps=120&channelId=MOBILE&keyword=' + parse.quote(category_3_name) + '&st=0&ci=&cf=&sc=&cityId=459&cp=0&iv=-1&ct=-1&sp=&spf=&prune=1' + f'&cat1_{parse.quote(category_1_name)}_cat2_{parse.quote(category_2_name)}_cat3_{parse.quote(category_3_name)}' 
                    except KeyError:
                        category_3__gotoApp = cat_3['gotoApp']
                        if 'search' in category_3__gotoApp:
                            print(f'cat3_{category_3_name} search by keyword')
                            cat3_url = 'https://search.suning.com/emall/mobile/clientSearch.jsonp?set=5&ps=120&channelId=MOBILE&keyword=' + parse.quote(category_3_name) + '&st=0&ci=&cf=&sc=&cityId=459&cp=0&iv=-1&ct=-1&sp=&spf=&prune=1' + f'&cat1_{parse.quote(category_1_name)}_cat2_{parse.quote(category_2_name)}_cat3_{parse.quote(category_3_name)}' 
                        elif '100020_' in category_3__gotoApp:
                            pattern = '100020_\d+'
                            adId = re.search(pattern, category_3__gotoApp)[0]
                            print(adId)
                            if adId:
                                ci = adId.split('_')[1]
                                cat3_url = 'https://ebuy.suning.com/mobile/clientSearch?ch=' + '100020' + '&iv=-1&keyword=&cityId=459&ci=' + ci + '&cp=0&ps=120&st=0&cf=&ct=-1&sp=&v=1.6'  + '&cf=&sc=&cityId=459&cp=0&iv=-1&ct=-1&sp=&spf=&prune=1' + f'&cat1_{parse.quote(category_1_name)}_cat2_{parse.quote(category_2_name)}_cat3_{parse.quote(category_3_name)}'
                            else:
                                cat3_url = 'https://search.suning.com/emall/mobile/clientSearch.jsonp?set=5&ps=120&channelId=MOBILE&keyword=' + parse.quote(category_3_name) + '&cf=&sc=0&cityId=459&ci=' + '&cf=&sc=0&cityId=459&cp=0&iv=-1&ct=-1&sp=&spf=&prune=1' + f'&cat1_{parse.quote(category_1_name)}_cat2_{parse.quote(category_2_name)}_cat3_{parse.quote(category_3_name)}'                       
                                
                        else:
                            print('#'*100, '----Warning!')
                            cat3_url = 'https://search.suning.com/emall/mobile/clientSearch.jsonp?set=5&ps=120&channelId=MOBILE&keyword=' + parse.quote(category_3_name) + '&st=0&ci=&cf=&sc=&cityId=459&cp=0&iv=-1&ct=-1&sp=&spf=&prune=1' + f'&cat1_{parse.quote(category_1_name)}_cat2_{parse.quote(category_2_name)}_cat3_{parse.quote(category_3_name)}' 
                    finally:
                        if 'cat2_品牌出版社_cat3_' not in parse.unquote(cat3_url):
                            print(cat3_url)
                            cat3_urls.append(cat3_url)
#                     else:
#                         print('#'*100, cat_3)
                    pp.pprint(cat_3)
                    print('\n\n')
                
            except Exception as e:
                print(e, i, j, category_2_name )

r = redis.Redis(host=REDIS_HOST, port=7379, db=0, password=REDIS_PASSWD, encoding='utf-8', decode_responses=True)
set_redis = {""}
set_redis.update(cat3_urls)
type(set_redis)
r.lpush('suning:start_urls', *set_redis)                
print(f'num of cat3_urls: {len(cat3_urls)}')            
for i in cat3_urls:
    if 'cat2_品牌出版社_cat3_' not in parse.unquote(i):
        print(parse.unquote(i))

这个分类json文件可以通过抓包抓取,代码如下:

import requests
from requests.exceptions import ConnectionError, ReadTimeout
import pprint
import json

url = 'https://search.suning.com/emall/mobile/clientSearch.jsonp?set=5&ps=120&channelId=MOBILE&keyword=11513369305'
cat_url = 'https://ds.suning.com/ds/terminal/categoryInfo/v1/99999998-.jsonp'
header = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
pp = pprint.PrettyPrinter(indent=4)
# try:
#     response = requests.get(url, timeout=10, headers=header)
#     if response.status_code == 200:
#         print('yes connected')
# #         print(response.text)
#         json_dict = response.json()
#         pp.pprint(json_dict['goods'])
        
# except (ConnectionError, ReadTimeout):
#     print('connect failed')
    
try:
    response = requests.get(cat_url, timeout=10, headers=header)
    if response.status_code == 200:
        print('yes connected')
#         print(response.text)
        json_dict = response.json()
        pp.pprint(json_dict['rs'])
        json.dumps('suning_cat.json')
        with open(r'E:\splash\suning\suning_cat.json', 'w') as f:
            json.dump(json_dict, f, ensure_ascii=False, indent=4, sort_keys=True)        
except (ConnectionError, ReadTimeout):
    print('connect failed')    

url构造好之后lpush到redis即可分布式爬虫了。

工具:scrapy-redis, mongodb, adsl vps自建代理池, gerapy分布式爬虫管理和监控

二话不说,直接上关键代码:

# # -*- coding: utf-8 -*-
import scrapy
import json
from scrapy_redis.spiders import RedisSpider
try:
    from urllib.parse import urlparse
except ImportError:
    from urlparse import urlparse
from suning.log import logger
from suning.items import SuningItem
import re
from urllib import parse

class SuningspdSpider(RedisSpider):
    name = 'suningspd'
    allowed_domains = ['suning.com']
    redis_key = 'suning:start_urls'
    
    def __init__(self, *args, **kwargs):
        super(SuningspdSpider, self).__init__(*args, **kwargs)
        
    def parse(self, response):
        url = response.url
        items = []
        if len(response.text) < 50 or response.status==402:
            logger.info(f'返回response为空,可能是最后一页或者超时了: {response.url}')
        else:    
            goods = json.loads(response.text)['goods']
            logger.info(f'goods: {parse.unquote(url)}')
            goodsCount = json.loads(response.text)["goodsCount"]
            logger.info(f'goodsCount: {goodsCount}')
            # 抓取的这个接口最多显示50页,每页120个商品,和PC端的限制是一样的
            max_page = min(int(goodsCount/120) + 1, 50)
            for good in goods:
                item = SuningItem()
                # logger.info(f'good: {good}\n\n')
                good_id = good['partnumberVendorId']
                good_url = 'https://product.suning.com/' + good_id.split('_')[1] +'/' + good_id.split('_')[0]  + '.html'
                # logger.info(good_id)
                item['_id'] = good_id
                item['good_sku_id'] = good_id
                item['good_name'] = good["catentdesc"]
                item['url'] = good_url
                item['cat1'] = url.split('&')[-1].split('_')[1]
                item['cat2'] = url.split('&')[-1].split('_')[3]
                item['cat3'] = url.split('&')[-1].split('_')[5]
                item['good_info'] = json.dumps(good, ensure_ascii=False)
                # logger.info(f'check item: {item}') 
                items.append(item)
                logger.warning(f'check num of items: {len(set(items))}')
                yield item
            page = re.search('cp=\d+', url)[0].split('=')[1]        
            if int(page) < max_page-1:            
                next_page = int(page) + 1
                logger.info(f'总共{max_page}页,还有下一页。继续采集第{next_page}页')
                next_url = re.sub(f'cp={page}', f'cp={next_page}', url)
                logger.info(f'next_url: {parse.unquote(next_url)}')
                # yield response.follow(next_url, self.parse)
                yield scrapy.Request(url=next_url, callback=self.parse, dont_filter=False)   
      

即使移动端接口也有每页120个商品,单分类下50页的限制,所以对于一些热门的分类,最多也只能爬取6000个SKU,但是这里我用了细分类,一共2397个,漏网之鱼不多。一个服务器加上六七个adsl vps拨号代理服务器,一天就能爬完全站商品,最后抓取了2843855个SKU,漏网之鱼肯定不少,毕竟我没有去PC端按照细分类去遍历所有的SKU,如果需要全部SKU,则需要去PC端去按照细分类+品牌分类的方式去遍历SKU,爬取京东450万SKU我就是用的这种方式。

获取的商品sku信息如下:

 'good_name': '宏碁Acer 暗影骑士·擎 15.6英寸电竞游戏本RGB背光学生笔记本电脑(i7-10750H 32G 1TB+1TBSSD '
              'GTX1660Ti 6G 144Hz)定制',
 'good_sku_id': '11961839381_0070174045',
 'url': 'https://product.suning.com/0070174045/11961839381.html'}}, {'index': 33, 'code': 11000, 'keyPattern': {'_id': 1}, 'keyValue': {'_id': '12200322581_0070640451'}, 'errmsg': 'E11000 duplicate key error collection: suning.product index: _id_ dup key: { _id: "12200322581_0070640451" }', 'op': {'_id': '12200322581_0070640451',
 'cat1': '%E4%BA%8C%E6%89%8B%E4%BC%98%E5%93%81',
 'cat2': '%E7%94%B5%E8%84%91%E5%8A%9E%E5%85%AC',
 'cat3': '%E7%AC%94%E8%AE%B0%E6%9C%AC',
 'good_info': '{"inventory": "0", "auxdescription": "北京 上海 南京 郑州 '
              '武汉都有库房,就近安排,即送大礼包,详情请咨询客服", "catentdesc": "联想ThinkPad T14 15CD '
              '英特尔酷睿i5 14英寸轻薄笔记本电脑(i5-10210U 32G 1TSSD固态 Win10)高分屏 红外摄像头 定制", '
              '"catentryId": "12200322581", "countOfarticle": "0", '
              '"partnumber": "12200322581", "totalCount": "0", "price": '
              '"7699.0", "brandId": "000052450", "saleStatus": 0, '
              '"contractInfos": "", "snFlag": false, "author": "", '
              '"suningSale": false, "praiseRate": "", "salesCode": '
              '"0070640451", "salesName": "麦田电脑旗舰店", "salesCode10": '
              '"0070640451", "beancurdFlag": "0", "goodsType": "1", '
              '"priceType": "2", "filters": [{}, {}, {}], "filterAttr": true, '
              '"isFav": true, "hwgLable": false, "spsLable": false, '
              '"baoguangHwg": "0", "dynamicImg": '
              '"//imgservice4.suning.cn/uimg1/b2c/image/Ews79nBjIFzU-qZ2hLdLoA.jpg", '
              '"docType": 1, "specificUrl": "http://18181818.suning.com", '
              '"salesUrl": "", "threeGroupId": "258004@507196", '
              '"threeGroupName": "创意设计笔记本,笔记本", "priority": "2", "orderType": '
              '"0", "snpmDity": "2", "shortBrandId": "2450", "isNew": "1", '
              '"partnumberVendorId": "12200322581_0070640451", "msList": [], '
              '"salesCount": "0", "jlfDirGroupIdList": [], "xdGroupIDCopy": '
              '[], "mdGoodType": "", "catalog": '
              '"NPC,ALL,NO,XDALL,10051,SNZB,NGD,ftzm,fnaep,wx001,vedio,wx002,XDC,wx003,NH", '
              '"isSnPharmacy": false, "extenalFileds": {"lpg_activeId": '
              '"null", "brandNameZh": "ThinkPad", "detailsUrl": '
              '"https://m4.pptvyun.com/pvod/e11a0/cjKfRo_ggIlgD6XsDEgSWDXrrCs/eyJkbCI6MTU5MDQ3OTg4NywiZXMiOjYwNDgwMCwiaWQiOiIwYTJkb3FpZHBLZWRucS1MNEsyZG9hZmhvNmljb0thY29hayIsInYiOiIxLjAifQ/0a2doqidpKednq-L4K2doafho6icoKacoak.mp4", '
              '"ZYLY": "0", "goodType": "Z001", "attrShow": [{"attrAppDesc": '
              '"", "attrAppTrueValue": "集成显卡", "attrAppValue": "集成显卡", '
              '"attrId": "solr_1855_attrId", "attrName": "显卡类型", "attrValue": '
              '"集成显卡", "attrValueId": "24087", "sort": "92.0"}, '
              '{"attrAppDesc": "", "attrAppTrueValue": "Intel i5", '
              '"attrAppValue": "Intel i5", "attrId": "solr_6160_attrId", '
              '"attrName": "CPU类型", "attrValue": "Intel i5", "attrValueId": '
              '"attrAppValue": "Intel i5", "attrId": "solr_6160_attrId", '
              '"attrName": "CPU类型", "attrValue": "Intel i5", "attrValueId": '
              '"45969", "sort": "91.0"}, {"attrAppDesc": "", '
              '"attrAppTrueValue": "1TB", "attrAppValue": "1TB", "attrId": '
              '"solr_2088_attrId", "attrName": "硬盘容量", "attrValue": "1TB", '
              '"attrValueId": "2280366", "sort": "0.0"}], "mdmGroupId": '
              '"R1502001", "appAttrTitle": ["商务办公", "轻薄便捷", "红外摄像头"], '
              '"groupIDCopy": ["157122:258003:258004", '
              '"157122:258003:507196"], "specificUrl": '
              '"http://18181818.suning.com", "groupIDCombination": '
              '["157122@A@电脑/办公/外设:258003@A@电脑整机:258004@A@笔记本", '
              '"157122@A@电脑/办公/外设:258003@A@电脑整机:507196@A@创意设计笔记本"], '
              '"activationFlag": "1", "commentShow": "0", "auxdescription": '
              '"北京 上海 南京 郑州 武汉都有库房,就近安排,即送大礼包,详情请咨询客服", "paramValue": "T14"}}',

如果需要评论数据,也很简单,需要两步:

  • 抓取移动端甚至PC端评论接口,非常简单
  • 去PC端遍历每个SKU,获取clusterID,这个参数是必须的,而且是js生成的,如果你有时间去研究它的JS代码应该也能找到这个参数的生成算法,但是苏宁反爬不厉害,不如去遍历每个商品,返回的response.text里有script,script里就含有这个关键字,正则提取即可。我的服务器硬盘空间快满了,得等段时间才能采集了。评论数量特别大,非常占空间。