苏宁易购全站爬虫(scrapy-redis分布式采集存储到mongoDB)

fiddler抓包获取移动端接口,高速采集苏宁易购全站商品而不会对对方服务器产生太多干扰

这几天爬了京东和苏宁易购的全站,京东450万左右的sku,苏宁易购280多万sku。虽然会有些漏网之鱼,但是肯定不多。两个网站都是调用的移动端APP的接口,fiddler抓包即可,只是需要注意构造的url是否合理,不然很容易漏掉很多sku。

记得京东某个副总裁说京东有500多万sku,和我这个爬虫结果基本一致,毕竟要考虑程序的错误率和反爬。京东对IP的封杀比较厉害,爬取全站估计需要十万左右的IP才够用,苏宁相对来说比较容易,一天就能爬完所有商品。

先记录下苏宁易购的爬虫过程吧。

目的:通过APP接口爬取全站商品

fiddler可以方便地抓取app接口,手机root后和PC在同一个WIFI下,手机wifi设置代理,服务器主机名填写PC的IP(局域网IP,ipconfig /all可以查询),端口填写8888(fiddler的默认端口)。这里需要注意的是,如果你PC打开了jupyter notebook,一定要确保jupyter notebook不能使用8888端口,jupyter notebook –port=7777 这样可以使用不同端口。

fiddler的具体使用google下即可,我的博客也曾经提到过。这里的接口:

https://search.suning.com/emall/mobile/clientSearch.jsonp?set=5&ps=120&channelId=MOBILE&keyword=空调免息&st=0&ci=&cf=&sc=&cityId=459&cp=26&iv=-1&ct=-1&sp=&spf=&prune=1&cat1_大家电_cat2_空调_cat3_空调免息
需要注意的是,这里用的关键词搜索,也可以用ci这个分类。这个分类代码并不是每一个三级分类都有,有的只能用关键词

苏宁爬虫的关键就是抓包构造url,这里贴出代码来:

import json
import pprint
import re
import redis
from urllib import parse

cat3_urls = []
pp = pprint.PrettyPrinter(indent=4)
with open(r'E:\splash\suning\suning_cat.json', 'r') as f:
    json_dict = json.load(f)    
#     pp.pprint(json_dict['rs'])
    print('num of cat1 categories:', len(json_dict['rs']))
    
    for i in range(1, len(json_dict['rs'])-3):
        category_1_name = json_dict['rs'][i]['dirName']
        print('category_1: ', category_1_name)
        print('category_1_id: ', json_dict['rs'][i]['id'])
        print(f'num of cat2 categories under {category_1_name}:', len(json_dict['rs'][i]))
        for j in range(len(json_dict['rs'][i]['children'])):
            try:
#                 print('name of category_2:', json_dict['rs'][i]['children'][j])
                category_2_name = json_dict['rs'][i]['children'][j]['dirName']
                print('category_2_id:', json_dict['rs'][i]['children'][j]['id'])
                num_cat3 = len(json_dict['rs'][i]['children'][j]['children'])
                print(f'num of categories in category_2_{category_2_name}:', num_cat3)  
                print(f'num_{j} cat2_{category_2_name} under category_1_{category_1_name}: ')
                pp.pprint(json_dict['rs'][i]['children'][j]['children'])
                # pp.pprint(json_dict['rs'][i]['children'][j]['children'])
                print('\n\ncategory_3:')
                for k in range(0, len(json_dict['rs'][i]['children'][j]['children'])):
                    print(f'num{k} cat3 under cat1_{category_1_name}_cat2_{category_2_name}')
                    cat_3 = json_dict['rs'][i]['children'][j]['children'][k]
                    category_3_name = cat_3['dirName']
                    category_3_id = cat_3['id']
                    try:
                        category_3_pcCi = cat_3['pcCi']
                        if '图书音像' not in category_1_name:
                            
                            cat3_url = 'https://search.suning.com/emall/mobile/clientSearch.jsonp?set=5&ps=120&channelId=MOBILE&keyword=&st=0&ci=' + category_3_pcCi + '&cf=&sc=&cityId=459&cp=0&iv=-1&ct=-1&sp=&spf=&prune=1'+ f'&cat1_{parse.quote(category_1_name)}_cat2_{parse.quote(category_2_name)}_cat3_{parse.quote(category_3_name)}'
                        else:
                            # '图书音像'大类比较特殊,不能使用pcCi
                            cat3_url = 'https://search.suning.com/emall/mobile/clientSearch.jsonp?set=5&ps=120&channelId=MOBILE&keyword=' + parse.quote(category_3_name) + '&st=0&ci=&cf=&sc=&cityId=459&cp=0&iv=-1&ct=-1&sp=&spf=&prune=1' + f'&cat1_{parse.quote(category_1_name)}_cat2_{parse.quote(category_2_name)}_cat3_{parse.quote(category_3_name)}' 
                    except KeyError:
                        category_3__gotoApp = cat_3['gotoApp']
                        if 'search' in category_3__gotoApp:
                            print(f'cat3_{category_3_name} search by keyword')
                            cat3_url = 'https://search.suning.com/emall/mobile/clientSearch.jsonp?set=5&ps=120&channelId=MOBILE&keyword=' + parse.quote(category_3_name) + '&st=0&ci=&cf=&sc=&cityId=459&cp=0&iv=-1&ct=-1&sp=&spf=&prune=1' + f'&cat1_{parse.quote(category_1_name)}_cat2_{parse.quote(category_2_name)}_cat3_{parse.quote(category_3_name)}' 
                        elif '100020_' in category_3__gotoApp:
                            pattern = '100020_\d+'
                            adId = re.search(pattern, category_3__gotoApp)[0]
                            print(adId)
                            if adId:
                                ci = adId.split('_')[1]
                                cat3_url = 'https://ebuy.suning.com/mobile/clientSearch?ch=' + '100020' + '&iv=-1&keyword=&cityId=459&ci=' + ci + '&cp=0&ps=120&st=0&cf=&ct=-1&sp=&v=1.6'  + '&cf=&sc=&cityId=459&cp=0&iv=-1&ct=-1&sp=&spf=&prune=1' + f'&cat1_{parse.quote(category_1_name)}_cat2_{parse.quote(category_2_name)}_cat3_{parse.quote(category_3_name)}'
                            else:
                                cat3_url = 'https://search.suning.com/emall/mobile/clientSearch.jsonp?set=5&ps=120&channelId=MOBILE&keyword=' + parse.quote(category_3_name) + '&cf=&sc=0&cityId=459&ci=' + '&cf=&sc=0&cityId=459&cp=0&iv=-1&ct=-1&sp=&spf=&prune=1' + f'&cat1_{parse.quote(category_1_name)}_cat2_{parse.quote(category_2_name)}_cat3_{parse.quote(category_3_name)}'                       
                                
                        else:
                            print('#'*100, '----Warning!')
                            cat3_url = 'https://search.suning.com/emall/mobile/clientSearch.jsonp?set=5&ps=120&channelId=MOBILE&keyword=' + parse.quote(category_3_name) + '&st=0&ci=&cf=&sc=&cityId=459&cp=0&iv=-1&ct=-1&sp=&spf=&prune=1' + f'&cat1_{parse.quote(category_1_name)}_cat2_{parse.quote(category_2_name)}_cat3_{parse.quote(category_3_name)}' 
                    finally:
                        if 'cat2_品牌出版社_cat3_' not in parse.unquote(cat3_url):
                            print(cat3_url)
                            cat3_urls.append(cat3_url)
#                     else:
#                         print('#'*100, cat_3)
                    pp.pprint(cat_3)
                    print('\n\n')
                
            except Exception as e:
                print(e, i, j, category_2_name )

r = redis.Redis(host=REDIS_HOST, port=7379, db=0, password=REDIS_PASSWD, encoding='utf-8', decode_responses=True)
set_redis = {""}
set_redis.update(cat3_urls)
type(set_redis)
r.lpush('suning:start_urls', *set_redis)                
print(f'num of cat3_urls: {len(cat3_urls)}')            
for i in cat3_urls:
    if 'cat2_品牌出版社_cat3_' not in parse.unquote(i):
        print(parse.unquote(i))

这个分类json文件可以通过抓包抓取,代码如下:

import requests
from requests.exceptions import ConnectionError, ReadTimeout
import pprint
import json

url = 'https://search.suning.com/emall/mobile/clientSearch.jsonp?set=5&ps=120&channelId=MOBILE&keyword=11513369305'
cat_url = 'https://ds.suning.com/ds/terminal/categoryInfo/v1/99999998-.jsonp'
header = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
pp = pprint.PrettyPrinter(indent=4)
# try:
#     response = requests.get(url, timeout=10, headers=header)
#     if response.status_code == 200:
#         print('yes connected')
# #         print(response.text)
#         json_dict = response.json()
#         pp.pprint(json_dict['goods'])
        
# except (ConnectionError, ReadTimeout):
#     print('connect failed')
    
try:
    response = requests.get(cat_url, timeout=10, headers=header)
    if response.status_code == 200:
        print('yes connected')
#         print(response.text)
        json_dict = response.json()
        pp.pprint(json_dict['rs'])
        json.dumps('suning_cat.json')
        with open(r'E:\splash\suning\suning_cat.json', 'w') as f:
            json.dump(json_dict, f, ensure_ascii=False, indent=4, sort_keys=True)        
except (ConnectionError, ReadTimeout):
    print('connect failed')    

url构造好之后lpush到redis即可分布式爬虫了。

工具:scrapy-redis, mongodb, adsl vps自建代理池, gerapy分布式爬虫管理和监控

二话不说,直接上关键代码:

# # -*- coding: utf-8 -*-
import scrapy
import json
from scrapy_redis.spiders import RedisSpider
try:
    from urllib.parse import urlparse
except ImportError:
    from urlparse import urlparse
from suning.log import logger
from suning.items import SuningItem
import re
from urllib import parse

class SuningspdSpider(RedisSpider):
    name = 'suningspd'
    allowed_domains = ['suning.com']
    redis_key = 'suning:start_urls'
    
    def __init__(self, *args, **kwargs):
        super(SuningspdSpider, self).__init__(*args, **kwargs)
        
    def parse(self, response):
        url = response.url
        items = []
        if len(response.text) < 50 or response.status==402:
            logger.info(f'返回response为空,可能是最后一页或者超时了: {response.url}')
        else:    
            goods = json.loads(response.text)['goods']
            logger.info(f'goods: {parse.unquote(url)}')
            goodsCount = json.loads(response.text)["goodsCount"]
            logger.info(f'goodsCount: {goodsCount}')
            # 抓取的这个接口最多显示50页,每页120个商品,和PC端的限制是一样的
            max_page = min(int(goodsCount/120) + 1, 50)
            for good in goods:
                item = SuningItem()
                # logger.info(f'good: {good}\n\n')
                good_id = good['partnumberVendorId']
                good_url = 'https://product.suning.com/' + good_id.split('_')[1] +'/' + good_id.split('_')[0]  + '.html'
                # logger.info(good_id)
                item['_id'] = good_id
                item['good_sku_id'] = good_id
                item['good_name'] = good["catentdesc"]
                item['url'] = good_url
                item['cat1'] = url.split('&')[-1].split('_')[1]
                item['cat2'] = url.split('&')[-1].split('_')[3]
                item['cat3'] = url.split('&')[-1].split('_')[5]
                item['good_info'] = json.dumps(good, ensure_ascii=False)
                # logger.info(f'check item: {item}') 
                items.append(item)
                logger.warning(f'check num of items: {len(set(items))}')
                yield item
            page = re.search('cp=\d+', url)[0].split('=')[1]        
            if int(page) < max_page-1:            
                next_page = int(page) + 1
                logger.info(f'总共{max_page}页,还有下一页。继续采集第{next_page}页')
                next_url = re.sub(f'cp={page}', f'cp={next_page}', url)
                logger.info(f'next_url: {parse.unquote(next_url)}')
                # yield response.follow(next_url, self.parse)
                yield scrapy.Request(url=next_url, callback=self.parse, dont_filter=False)   
      

即使移动端接口也有每页120个商品,单分类下50页的限制,所以对于一些热门的分类,最多也只能爬取6000个SKU,但是这里我用了细分类,一共2397个,漏网之鱼不多。一个服务器加上六七个adsl vps拨号代理服务器,一天就能爬完全站商品,最后抓取了2843855个SKU,漏网之鱼肯定不少,毕竟我没有去PC端按照细分类去遍历所有的SKU,如果需要全部SKU,则需要去PC端去按照细分类+品牌分类的方式去遍历SKU,爬取京东450万SKU我就是用的这种方式。

获取的商品sku信息如下:

 'good_name': '宏碁Acer 暗影骑士·擎 15.6英寸电竞游戏本RGB背光学生笔记本电脑(i7-10750H 32G 1TB+1TBSSD '
              'GTX1660Ti 6G 144Hz)定制',
 'good_sku_id': '11961839381_0070174045',
 'url': 'https://product.suning.com/0070174045/11961839381.html'}}, {'index': 33, 'code': 11000, 'keyPattern': {'_id': 1}, 'keyValue': {'_id': '12200322581_0070640451'}, 'errmsg': 'E11000 duplicate key error collection: suning.product index: _id_ dup key: { _id: "12200322581_0070640451" }', 'op': {'_id': '12200322581_0070640451',
 'cat1': '%E4%BA%8C%E6%89%8B%E4%BC%98%E5%93%81',
 'cat2': '%E7%94%B5%E8%84%91%E5%8A%9E%E5%85%AC',
 'cat3': '%E7%AC%94%E8%AE%B0%E6%9C%AC',
 'good_info': '{"inventory": "0", "auxdescription": "北京 上海 南京 郑州 '
              '武汉都有库房,就近安排,即送大礼包,详情请咨询客服", "catentdesc": "联想ThinkPad T14 15CD '
              '英特尔酷睿i5 14英寸轻薄笔记本电脑(i5-10210U 32G 1TSSD固态 Win10)高分屏 红外摄像头 定制", '
              '"catentryId": "12200322581", "countOfarticle": "0", '
              '"partnumber": "12200322581", "totalCount": "0", "price": '
              '"7699.0", "brandId": "000052450", "saleStatus": 0, '
              '"contractInfos": "", "snFlag": false, "author": "", '
              '"suningSale": false, "praiseRate": "", "salesCode": '
              '"0070640451", "salesName": "麦田电脑旗舰店", "salesCode10": '
              '"0070640451", "beancurdFlag": "0", "goodsType": "1", '
              '"priceType": "2", "filters": [{}, {}, {}], "filterAttr": true, '
              '"isFav": true, "hwgLable": false, "spsLable": false, '
              '"baoguangHwg": "0", "dynamicImg": '
              '"//imgservice4.suning.cn/uimg1/b2c/image/Ews79nBjIFzU-qZ2hLdLoA.jpg", '
              '"docType": 1, "specificUrl": "http://18181818.suning.com", '
              '"salesUrl": "", "threeGroupId": "258004@507196", '
              '"threeGroupName": "创意设计笔记本,笔记本", "priority": "2", "orderType": '
              '"0", "snpmDity": "2", "shortBrandId": "2450", "isNew": "1", '
              '"partnumberVendorId": "12200322581_0070640451", "msList": [], '
              '"salesCount": "0", "jlfDirGroupIdList": [], "xdGroupIDCopy": '
              '[], "mdGoodType": "", "catalog": '
              '"NPC,ALL,NO,XDALL,10051,SNZB,NGD,ftzm,fnaep,wx001,vedio,wx002,XDC,wx003,NH", '
              '"isSnPharmacy": false, "extenalFileds": {"lpg_activeId": '
              '"null", "brandNameZh": "ThinkPad", "detailsUrl": '
              '"https://m4.pptvyun.com/pvod/e11a0/cjKfRo_ggIlgD6XsDEgSWDXrrCs/eyJkbCI6MTU5MDQ3OTg4NywiZXMiOjYwNDgwMCwiaWQiOiIwYTJkb3FpZHBLZWRucS1MNEsyZG9hZmhvNmljb0thY29hayIsInYiOiIxLjAifQ/0a2doqidpKednq-L4K2doafho6icoKacoak.mp4", '
              '"ZYLY": "0", "goodType": "Z001", "attrShow": [{"attrAppDesc": '
              '"", "attrAppTrueValue": "集成显卡", "attrAppValue": "集成显卡", '
              '"attrId": "solr_1855_attrId", "attrName": "显卡类型", "attrValue": '
              '"集成显卡", "attrValueId": "24087", "sort": "92.0"}, '
              '{"attrAppDesc": "", "attrAppTrueValue": "Intel i5", '
              '"attrAppValue": "Intel i5", "attrId": "solr_6160_attrId", '
              '"attrName": "CPU类型", "attrValue": "Intel i5", "attrValueId": '
              '"attrAppValue": "Intel i5", "attrId": "solr_6160_attrId", '
              '"attrName": "CPU类型", "attrValue": "Intel i5", "attrValueId": '
              '"45969", "sort": "91.0"}, {"attrAppDesc": "", '
              '"attrAppTrueValue": "1TB", "attrAppValue": "1TB", "attrId": '
              '"solr_2088_attrId", "attrName": "硬盘容量", "attrValue": "1TB", '
              '"attrValueId": "2280366", "sort": "0.0"}], "mdmGroupId": '
              '"R1502001", "appAttrTitle": ["商务办公", "轻薄便捷", "红外摄像头"], '
              '"groupIDCopy": ["157122:258003:258004", '
              '"157122:258003:507196"], "specificUrl": '
              '"http://18181818.suning.com", "groupIDCombination": '
              '["157122@A@电脑/办公/外设:258003@A@电脑整机:258004@A@笔记本", '
              '"157122@A@电脑/办公/外设:258003@A@电脑整机:507196@A@创意设计笔记本"], '
              '"activationFlag": "1", "commentShow": "0", "auxdescription": '
              '"北京 上海 南京 郑州 武汉都有库房,就近安排,即送大礼包,详情请咨询客服", "paramValue": "T14"}}',

如果需要评论数据,也很简单,需要两步:

  • 抓取移动端甚至PC端评论接口,非常简单
  • 去PC端遍历每个SKU,获取clusterID,这个参数是必须的,而且是js生成的,如果你有时间去研究它的JS代码应该也能找到这个参数的生成算法,但是苏宁反爬不厉害,不如去遍历每个商品,返回的response.text里有script,script里就含有这个关键字,正则提取即可。我的服务器硬盘空间快满了,得等段时间才能采集了。评论数量特别大,非常占空间。

发表评论