苏宁易购全站爬虫(scrapy-redis分布式采集存储到mongoDB)

fiddler抓包获取移动端接口,高速采集苏宁易购全站商品而不会对对方服务器产生太多干扰

这几天爬了京东和苏宁易购的全站,京东450万左右的sku,苏宁易购280多万sku。虽然会有些漏网之鱼,但是肯定不多。两个网站都是调用的移动端APP的接口,fiddler抓包即可,只是需要注意构造的url是否合理,不然很容易漏掉很多sku。

记得京东某个副总裁说京东有500多万sku,和我这个爬虫结果基本一致,毕竟要考虑程序的错误率和反爬。京东对IP的封杀比较厉害,爬取全站估计需要十万左右的IP才够用,苏宁相对来说比较容易,一天就能爬完所有商品。

先记录下苏宁易购的爬虫过程吧。

目的:通过APP接口爬取全站商品

fiddler可以方便地抓取app接口,手机root后和PC在同一个WIFI下,手机wifi设置代理,服务器主机名填写PC的IP(局域网IP,ipconfig /all可以查询),端口填写8888(fiddler的默认端口)。这里需要注意的是,如果你PC打开了jupyter notebook,一定要确保jupyter notebook不能使用8888端口,jupyter notebook –port=7777 这样可以使用不同端口。

fiddler的具体使用google下即可,我的博客也曾经提到过。这里的接口:

https://search.suning.com/emall/mobile/clientSearch.jsonp?set=5&ps=120&channelId=MOBILE&keyword=空调免息&st=0&ci=&cf=&sc=&cityId=459&cp=26&iv=-1&ct=-1&sp=&spf=&prune=1&cat1_大家电_cat2_空调_cat3_空调免息
需要注意的是,这里用的关键词搜索,也可以用ci这个分类。这个分类代码并不是每一个三级分类都有,有的只能用关键词

苏宁爬虫的关键就是抓包构造url,这里贴出代码来:

import json
import pprint
import re
import redis
from urllib import parse

cat3_urls = []
pp = pprint.PrettyPrinter(indent=4)
with open(r'E:\splash\suning\suning_cat.json', 'r') as f:
    json_dict = json.load(f)    
#     pp.pprint(json_dict['rs'])
    print('num of cat1 categories:', len(json_dict['rs']))
    
    for i in range(1, len(json_dict['rs'])-3):
        category_1_name = json_dict['rs'][i]['dirName']
        print('category_1: ', category_1_name)
        print('category_1_id: ', json_dict['rs'][i]['id'])
        print(f'num of cat2 categories under {category_1_name}:', len(json_dict['rs'][i]))
        for j in range(len(json_dict['rs'][i]['children'])):
            try:
#                 print('name of category_2:', json_dict['rs'][i]['children'][j])
                category_2_name = json_dict['rs'][i]['children'][j]['dirName']
                print('category_2_id:', json_dict['rs'][i]['children'][j]['id'])
                num_cat3 = len(json_dict['rs'][i]['children'][j]['children'])
                print(f'num of categories in category_2_{category_2_name}:', num_cat3)  
                print(f'num_{j} cat2_{category_2_name} under category_1_{category_1_name}: ')
                pp.pprint(json_dict['rs'][i]['children'][j]['children'])
                # pp.pprint(json_dict['rs'][i]['children'][j]['children'])
                print('\n\ncategory_3:')
                for k in range(0, len(json_dict['rs'][i]['children'][j]['children'])):
                    print(f'num{k} cat3 under cat1_{category_1_name}_cat2_{category_2_name}')
                    cat_3 = json_dict['rs'][i]['children'][j]['children'][k]
                    category_3_name = cat_3['dirName']
                    category_3_id = cat_3['id']
                    try:
                        category_3_pcCi = cat_3['pcCi']
                        if '图书音像' not in category_1_name:
                            
                            cat3_url = 'https://search.suning.com/emall/mobile/clientSearch.jsonp?set=5&ps=120&channelId=MOBILE&keyword=&st=0&ci=' + category_3_pcCi + '&cf=&sc=&cityId=459&cp=0&iv=-1&ct=-1&sp=&spf=&prune=1'+ f'&cat1_{parse.quote(category_1_name)}_cat2_{parse.quote(category_2_name)}_cat3_{parse.quote(category_3_name)}'
                        else:
                            # '图书音像'大类比较特殊,不能使用pcCi
                            cat3_url = 'https://search.suning.com/emall/mobile/clientSearch.jsonp?set=5&ps=120&channelId=MOBILE&keyword=' + parse.quote(category_3_name) + '&st=0&ci=&cf=&sc=&cityId=459&cp=0&iv=-1&ct=-1&sp=&spf=&prune=1' + f'&cat1_{parse.quote(category_1_name)}_cat2_{parse.quote(category_2_name)}_cat3_{parse.quote(category_3_name)}' 
                    except KeyError:
                        category_3__gotoApp = cat_3['gotoApp']
                        if 'search' in category_3__gotoApp:
                            print(f'cat3_{category_3_name} search by keyword')
                            cat3_url = 'https://search.suning.com/emall/mobile/clientSearch.jsonp?set=5&ps=120&channelId=MOBILE&keyword=' + parse.quote(category_3_name) + '&st=0&ci=&cf=&sc=&cityId=459&cp=0&iv=-1&ct=-1&sp=&spf=&prune=1' + f'&cat1_{parse.quote(category_1_name)}_cat2_{parse.quote(category_2_name)}_cat3_{parse.quote(category_3_name)}' 
                        elif '100020_' in category_3__gotoApp:
                            pattern = '100020_\d+'
                            adId = re.search(pattern, category_3__gotoApp)[0]
                            print(adId)
                            if adId:
                                ci = adId.split('_')[1]
                                cat3_url = 'https://ebuy.suning.com/mobile/clientSearch?ch=' + '100020' + '&iv=-1&keyword=&cityId=459&ci=' + ci + '&cp=0&ps=120&st=0&cf=&ct=-1&sp=&v=1.6'  + '&cf=&sc=&cityId=459&cp=0&iv=-1&ct=-1&sp=&spf=&prune=1' + f'&cat1_{parse.quote(category_1_name)}_cat2_{parse.quote(category_2_name)}_cat3_{parse.quote(category_3_name)}'
                            else:
                                cat3_url = 'https://search.suning.com/emall/mobile/clientSearch.jsonp?set=5&ps=120&channelId=MOBILE&keyword=' + parse.quote(category_3_name) + '&cf=&sc=0&cityId=459&ci=' + '&cf=&sc=0&cityId=459&cp=0&iv=-1&ct=-1&sp=&spf=&prune=1' + f'&cat1_{parse.quote(category_1_name)}_cat2_{parse.quote(category_2_name)}_cat3_{parse.quote(category_3_name)}'                       
                                
                        else:
                            print('#'*100, '----Warning!')
                            cat3_url = 'https://search.suning.com/emall/mobile/clientSearch.jsonp?set=5&ps=120&channelId=MOBILE&keyword=' + parse.quote(category_3_name) + '&st=0&ci=&cf=&sc=&cityId=459&cp=0&iv=-1&ct=-1&sp=&spf=&prune=1' + f'&cat1_{parse.quote(category_1_name)}_cat2_{parse.quote(category_2_name)}_cat3_{parse.quote(category_3_name)}' 
                    finally:
                        if 'cat2_品牌出版社_cat3_' not in parse.unquote(cat3_url):
                            print(cat3_url)
                            cat3_urls.append(cat3_url)
#                     else:
#                         print('#'*100, cat_3)
                    pp.pprint(cat_3)
                    print('\n\n')
                
            except Exception as e:
                print(e, i, j, category_2_name )

r = redis.Redis(host=REDIS_HOST, port=7379, db=0, password=REDIS_PASSWD, encoding='utf-8', decode_responses=True)
set_redis = {""}
set_redis.update(cat3_urls)
type(set_redis)
r.lpush('suning:start_urls', *set_redis)                
print(f'num of cat3_urls: {len(cat3_urls)}')            
for i in cat3_urls:
    if 'cat2_品牌出版社_cat3_' not in parse.unquote(i):
        print(parse.unquote(i))

这个分类json文件可以通过抓包抓取,代码如下:

import requests
from requests.exceptions import ConnectionError, ReadTimeout
import pprint
import json

url = 'https://search.suning.com/emall/mobile/clientSearch.jsonp?set=5&ps=120&channelId=MOBILE&keyword=11513369305'
cat_url = 'https://ds.suning.com/ds/terminal/categoryInfo/v1/99999998-.jsonp'
header = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
pp = pprint.PrettyPrinter(indent=4)
# try:
#     response = requests.get(url, timeout=10, headers=header)
#     if response.status_code == 200:
#         print('yes connected')
# #         print(response.text)
#         json_dict = response.json()
#         pp.pprint(json_dict['goods'])
        
# except (ConnectionError, ReadTimeout):
#     print('connect failed')
    
try:
    response = requests.get(cat_url, timeout=10, headers=header)
    if response.status_code == 200:
        print('yes connected')
#         print(response.text)
        json_dict = response.json()
        pp.pprint(json_dict['rs'])
        json.dumps('suning_cat.json')
        with open(r'E:\splash\suning\suning_cat.json', 'w') as f:
            json.dump(json_dict, f, ensure_ascii=False, indent=4, sort_keys=True)        
except (ConnectionError, ReadTimeout):
    print('connect failed')    

url构造好之后lpush到redis即可分布式爬虫了。

工具:scrapy-redis, mongodb, adsl vps自建代理池, gerapy分布式爬虫管理和监控

二话不说,直接上关键代码:

# # -*- coding: utf-8 -*-
import scrapy
import json
from scrapy_redis.spiders import RedisSpider
try:
    from urllib.parse import urlparse
except ImportError:
    from urlparse import urlparse
from suning.log import logger
from suning.items import SuningItem
import re
from urllib import parse

class SuningspdSpider(RedisSpider):
    name = 'suningspd'
    allowed_domains = ['suning.com']
    redis_key = 'suning:start_urls'
    
    def __init__(self, *args, **kwargs):
        super(SuningspdSpider, self).__init__(*args, **kwargs)
        
    def parse(self, response):
        url = response.url
        items = []
        if len(response.text) < 50 or response.status==402:
            logger.info(f'返回response为空,可能是最后一页或者超时了: {response.url}')
        else:    
            goods = json.loads(response.text)['goods']
            logger.info(f'goods: {parse.unquote(url)}')
            goodsCount = json.loads(response.text)["goodsCount"]
            logger.info(f'goodsCount: {goodsCount}')
            # 抓取的这个接口最多显示50页,每页120个商品,和PC端的限制是一样的
            max_page = min(int(goodsCount/120) + 1, 50)
            for good in goods:
                item = SuningItem()
                # logger.info(f'good: {good}\n\n')
                good_id = good['partnumberVendorId']
                good_url = 'https://product.suning.com/' + good_id.split('_')[1] +'/' + good_id.split('_')[0]  + '.html'
                # logger.info(good_id)
                item['_id'] = good_id
                item['good_sku_id'] = good_id
                item['good_name'] = good["catentdesc"]
                item['url'] = good_url
                item['cat1'] = url.split('&')[-1].split('_')[1]
                item['cat2'] = url.split('&')[-1].split('_')[3]
                item['cat3'] = url.split('&')[-1].split('_')[5]
                item['good_info'] = json.dumps(good, ensure_ascii=False)
                # logger.info(f'check item: {item}') 
                items.append(item)
                logger.warning(f'check num of items: {len(set(items))}')
                yield item
            page = re.search('cp=\d+', url)[0].split('=')[1]        
            if int(page) < max_page-1:            
                next_page = int(page) + 1
                logger.info(f'总共{max_page}页,还有下一页。继续采集第{next_page}页')
                next_url = re.sub(f'cp={page}', f'cp={next_page}', url)
                logger.info(f'next_url: {parse.unquote(next_url)}')
                # yield response.follow(next_url, self.parse)
                yield scrapy.Request(url=next_url, callback=self.parse, dont_filter=False)   
      

即使移动端接口也有每页120个商品,单分类下50页的限制,所以对于一些热门的分类,最多也只能爬取6000个SKU,但是这里我用了细分类,一共2397个,漏网之鱼不多。一个服务器加上六七个adsl vps拨号代理服务器,一天就能爬完全站商品,最后抓取了2843855个SKU,漏网之鱼肯定不少,毕竟我没有去PC端按照细分类去遍历所有的SKU,如果需要全部SKU,则需要去PC端去按照细分类+品牌分类的方式去遍历SKU,爬取京东450万SKU我就是用的这种方式。

获取的商品sku信息如下:

 'good_name': '宏碁Acer 暗影骑士·擎 15.6英寸电竞游戏本RGB背光学生笔记本电脑(i7-10750H 32G 1TB+1TBSSD '
              'GTX1660Ti 6G 144Hz)定制',
 'good_sku_id': '11961839381_0070174045',
 'url': 'https://product.suning.com/0070174045/11961839381.html'}}, {'index': 33, 'code': 11000, 'keyPattern': {'_id': 1}, 'keyValue': {'_id': '12200322581_0070640451'}, 'errmsg': 'E11000 duplicate key error collection: suning.product index: _id_ dup key: { _id: "12200322581_0070640451" }', 'op': {'_id': '12200322581_0070640451',
 'cat1': '%E4%BA%8C%E6%89%8B%E4%BC%98%E5%93%81',
 'cat2': '%E7%94%B5%E8%84%91%E5%8A%9E%E5%85%AC',
 'cat3': '%E7%AC%94%E8%AE%B0%E6%9C%AC',
 'good_info': '{"inventory": "0", "auxdescription": "北京 上海 南京 郑州 '
              '武汉都有库房,就近安排,即送大礼包,详情请咨询客服", "catentdesc": "联想ThinkPad T14 15CD '
              '英特尔酷睿i5 14英寸轻薄笔记本电脑(i5-10210U 32G 1TSSD固态 Win10)高分屏 红外摄像头 定制", '
              '"catentryId": "12200322581", "countOfarticle": "0", '
              '"partnumber": "12200322581", "totalCount": "0", "price": '
              '"7699.0", "brandId": "000052450", "saleStatus": 0, '
              '"contractInfos": "", "snFlag": false, "author": "", '
              '"suningSale": false, "praiseRate": "", "salesCode": '
              '"0070640451", "salesName": "麦田电脑旗舰店", "salesCode10": '
              '"0070640451", "beancurdFlag": "0", "goodsType": "1", '
              '"priceType": "2", "filters": [{}, {}, {}], "filterAttr": true, '
              '"isFav": true, "hwgLable": false, "spsLable": false, '
              '"baoguangHwg": "0", "dynamicImg": '
              '"//imgservice4.suning.cn/uimg1/b2c/image/Ews79nBjIFzU-qZ2hLdLoA.jpg", '
              '"docType": 1, "specificUrl": "http://18181818.suning.com", '
              '"salesUrl": "", "threeGroupId": "258004@507196", '
              '"threeGroupName": "创意设计笔记本,笔记本", "priority": "2", "orderType": '
              '"0", "snpmDity": "2", "shortBrandId": "2450", "isNew": "1", '
              '"partnumberVendorId": "12200322581_0070640451", "msList": [], '
              '"salesCount": "0", "jlfDirGroupIdList": [], "xdGroupIDCopy": '
              '[], "mdGoodType": "", "catalog": '
              '"NPC,ALL,NO,XDALL,10051,SNZB,NGD,ftzm,fnaep,wx001,vedio,wx002,XDC,wx003,NH", '
              '"isSnPharmacy": false, "extenalFileds": {"lpg_activeId": '
              '"null", "brandNameZh": "ThinkPad", "detailsUrl": '
              '"https://m4.pptvyun.com/pvod/e11a0/cjKfRo_ggIlgD6XsDEgSWDXrrCs/eyJkbCI6MTU5MDQ3OTg4NywiZXMiOjYwNDgwMCwiaWQiOiIwYTJkb3FpZHBLZWRucS1MNEsyZG9hZmhvNmljb0thY29hayIsInYiOiIxLjAifQ/0a2doqidpKednq-L4K2doafho6icoKacoak.mp4", '
              '"ZYLY": "0", "goodType": "Z001", "attrShow": [{"attrAppDesc": '
              '"", "attrAppTrueValue": "集成显卡", "attrAppValue": "集成显卡", '
              '"attrId": "solr_1855_attrId", "attrName": "显卡类型", "attrValue": '
              '"集成显卡", "attrValueId": "24087", "sort": "92.0"}, '
              '{"attrAppDesc": "", "attrAppTrueValue": "Intel i5", '
              '"attrAppValue": "Intel i5", "attrId": "solr_6160_attrId", '
              '"attrName": "CPU类型", "attrValue": "Intel i5", "attrValueId": '
              '"attrAppValue": "Intel i5", "attrId": "solr_6160_attrId", '
              '"attrName": "CPU类型", "attrValue": "Intel i5", "attrValueId": '
              '"45969", "sort": "91.0"}, {"attrAppDesc": "", '
              '"attrAppTrueValue": "1TB", "attrAppValue": "1TB", "attrId": '
              '"solr_2088_attrId", "attrName": "硬盘容量", "attrValue": "1TB", '
              '"attrValueId": "2280366", "sort": "0.0"}], "mdmGroupId": '
              '"R1502001", "appAttrTitle": ["商务办公", "轻薄便捷", "红外摄像头"], '
              '"groupIDCopy": ["157122:258003:258004", '
              '"157122:258003:507196"], "specificUrl": '
              '"http://18181818.suning.com", "groupIDCombination": '
              '["157122@A@电脑/办公/外设:258003@A@电脑整机:258004@A@笔记本", '
              '"157122@A@电脑/办公/外设:258003@A@电脑整机:507196@A@创意设计笔记本"], '
              '"activationFlag": "1", "commentShow": "0", "auxdescription": '
              '"北京 上海 南京 郑州 武汉都有库房,就近安排,即送大礼包,详情请咨询客服", "paramValue": "T14"}}',

如果需要评论数据,也很简单,需要两步:

  • 抓取移动端甚至PC端评论接口,非常简单
  • 去PC端遍历每个SKU,获取clusterID,这个参数是必须的,而且是js生成的,如果你有时间去研究它的JS代码应该也能找到这个参数的生成算法,但是苏宁反爬不厉害,不如去遍历每个商品,返回的response.text里有script,script里就含有这个关键字,正则提取即可。我的服务器硬盘空间快满了,得等段时间才能采集了。评论数量特别大,非常占空间。

mongodb内存监控和管理

多种方式监控和管理mongoDB,查看资源占用情况和运行状态

8G的腾讯云服务器内存告急,谁让自己采集了两个电商平台的全站商品sku呢?来看下第一个罪魁祸首mongodb,运行命令 top -p $(pidof mongod) 的结果:

top -p $(pidof mongod)内存使用情况

https://www.bmc.com/blogs/mongodb-memory-usage-and-management/

这里可以看出mongodb用了一般内存,另一半是redis-server,好家伙,都是耗内存的主儿!

当然也可以在mongodb终端查看:

> db.serverStatus().mem
{ "bits" : 64, "resident" : 2642, "virtual" : 4628, "supported" : true }
  • resident—amount of actual physical memory (RAM) used by a process. 一个进程使用的实际物理内存
  • virtual—RAM plus memory that has extended to the file system cache, i.e. virtual memory. 虚拟内存
  • mapped—MongoDB since version 3.2 does not do memory mapping of files anymore. That was used by the previous memory management module called MMAPv1. Now it uses WiredTiger by default. 映射内存,MongoDB 3.2+不再使用,而是开始用WiredTiger

当然也可以在mongodb terminal运行下面两个命令:

var mem = db.serverStatus().tcmalloc;

mem.tcmalloc.formattedString
> var mem = db.serverStatus().tcmalloc;
> mem.tcmalloc.formattedString
------------------------------------------------
MALLOC:     2760578144 ( 2632.7 MiB) Bytes in use by application
MALLOC: +    416550912 (  397.3 MiB) Bytes in page heap freelist
MALLOC: +     17226192 (   16.4 MiB) Bytes in central cache freelist
MALLOC: +        33792 (    0.0 MiB) Bytes in transfer cache freelist
MALLOC: +     12516816 (   11.9 MiB) Bytes in thread cache freelists
MALLOC: +     13762560 (   13.1 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =   3220668416 ( 3071.5 MiB) Actual memory used (physical + swap)
MALLOC: +     58572800 (   55.9 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =   3279241216 ( 3127.3 MiB) Virtual address space used
MALLOC:
MALLOC:          67767              Spans in use
MALLOC:             45              Thread heaps in use
MALLOC:           4096              Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.

那有没有更炫酷的方式呢?当然有!而且还是免费的。在MongoDB命令行终端运行:

> db.enableFreeMonitoring()
{
	"state" : "enabled",
	"message" : "To see your monitoring data, navigate to the unique URL below. Anyone you share the URL with will also be able to view this page. You can disable monitoring at any time by running db.disableFreeMonitoring().",
	"url" : "https://cloud.mongodb.com/freemonitoring/cluster/xxxxxxxxxxxxxxxxxxx",
	"userReminder" : "",
	"ok" : 1
}

为了安全起见我的监控url用xxxxxx替换了,一睹为快:

mongoDB免费监控web

更多监控指数可以自己看下,对于我来说可以清楚地看到爬虫运行状态和mongodb资源占用情况。