这几天爬了京东和苏宁易购的全站,京东450万左右的sku,苏宁易购280多万sku。虽然会有些漏网之鱼,但是肯定不多。两个网站都是调用的移动端APP的接口,fiddler抓包即可,只是需要注意构造的url是否合理,不然很容易漏掉很多sku。
记得京东某个副总裁说京东有500多万sku,和我这个爬虫结果基本一致,毕竟要考虑程序的错误率和反爬。京东对IP的封杀比较厉害,爬取全站估计需要十万左右的IP才够用,苏宁相对来说比较容易,一天就能爬完所有商品。
先记录下苏宁易购的爬虫过程吧。
目的:通过APP接口爬取全站商品
fiddler可以方便地抓取app接口,手机root后和PC在同一个WIFI下,手机wifi设置代理,服务器主机名填写PC的IP(局域网IP,ipconfig /all可以查询),端口填写8888(fiddler的默认端口)。这里需要注意的是,如果你PC打开了jupyter notebook,一定要确保jupyter notebook不能使用8888端口,jupyter notebook –port=7777 这样可以使用不同端口。
fiddler的具体使用google下即可,我的博客也曾经提到过。这里的接口:
https://search.suning.com/emall/mobile/clientSearch.jsonp?set=5&ps=120&channelId=MOBILE&keyword=空调免息&st=0&ci=&cf=&sc=&cityId=459&cp=26&iv=-1&ct=-1&sp=&spf=&prune=1&cat1_大家电_cat2_空调_cat3_空调免息
需要注意的是,这里用的关键词搜索,也可以用ci这个分类。这个分类代码并不是每一个三级分类都有,有的只能用关键词
苏宁爬虫的关键就是抓包构造url,这里贴出代码来:
import json
import pprint
import re
import redis
from urllib import parse
cat3_urls = []
pp = pprint.PrettyPrinter(indent=4)
with open(r'E:\splash\suning\suning_cat.json', 'r') as f:
json_dict = json.load(f)
# pp.pprint(json_dict['rs'])
print('num of cat1 categories:', len(json_dict['rs']))
for i in range(1, len(json_dict['rs'])-3):
category_1_name = json_dict['rs'][i]['dirName']
print('category_1: ', category_1_name)
print('category_1_id: ', json_dict['rs'][i]['id'])
print(f'num of cat2 categories under {category_1_name}:', len(json_dict['rs'][i]))
for j in range(len(json_dict['rs'][i]['children'])):
try:
# print('name of category_2:', json_dict['rs'][i]['children'][j])
category_2_name = json_dict['rs'][i]['children'][j]['dirName']
print('category_2_id:', json_dict['rs'][i]['children'][j]['id'])
num_cat3 = len(json_dict['rs'][i]['children'][j]['children'])
print(f'num of categories in category_2_{category_2_name}:', num_cat3)
print(f'num_{j} cat2_{category_2_name} under category_1_{category_1_name}: ')
pp.pprint(json_dict['rs'][i]['children'][j]['children'])
# pp.pprint(json_dict['rs'][i]['children'][j]['children'])
print('\n\ncategory_3:')
for k in range(0, len(json_dict['rs'][i]['children'][j]['children'])):
print(f'num{k} cat3 under cat1_{category_1_name}_cat2_{category_2_name}')
cat_3 = json_dict['rs'][i]['children'][j]['children'][k]
category_3_name = cat_3['dirName']
category_3_id = cat_3['id']
try:
category_3_pcCi = cat_3['pcCi']
if '图书音像' not in category_1_name:
cat3_url = 'https://search.suning.com/emall/mobile/clientSearch.jsonp?set=5&ps=120&channelId=MOBILE&keyword=&st=0&ci=' + category_3_pcCi + '&cf=&sc=&cityId=459&cp=0&iv=-1&ct=-1&sp=&spf=&prune=1'+ f'&cat1_{parse.quote(category_1_name)}_cat2_{parse.quote(category_2_name)}_cat3_{parse.quote(category_3_name)}'
else:
# '图书音像'大类比较特殊,不能使用pcCi
cat3_url = 'https://search.suning.com/emall/mobile/clientSearch.jsonp?set=5&ps=120&channelId=MOBILE&keyword=' + parse.quote(category_3_name) + '&st=0&ci=&cf=&sc=&cityId=459&cp=0&iv=-1&ct=-1&sp=&spf=&prune=1' + f'&cat1_{parse.quote(category_1_name)}_cat2_{parse.quote(category_2_name)}_cat3_{parse.quote(category_3_name)}'
except KeyError:
category_3__gotoApp = cat_3['gotoApp']
if 'search' in category_3__gotoApp:
print(f'cat3_{category_3_name} search by keyword')
cat3_url = 'https://search.suning.com/emall/mobile/clientSearch.jsonp?set=5&ps=120&channelId=MOBILE&keyword=' + parse.quote(category_3_name) + '&st=0&ci=&cf=&sc=&cityId=459&cp=0&iv=-1&ct=-1&sp=&spf=&prune=1' + f'&cat1_{parse.quote(category_1_name)}_cat2_{parse.quote(category_2_name)}_cat3_{parse.quote(category_3_name)}'
elif '100020_' in category_3__gotoApp:
pattern = '100020_\d+'
adId = re.search(pattern, category_3__gotoApp)[0]
print(adId)
if adId:
ci = adId.split('_')[1]
cat3_url = 'https://ebuy.suning.com/mobile/clientSearch?ch=' + '100020' + '&iv=-1&keyword=&cityId=459&ci=' + ci + '&cp=0&ps=120&st=0&cf=&ct=-1&sp=&v=1.6' + '&cf=&sc=&cityId=459&cp=0&iv=-1&ct=-1&sp=&spf=&prune=1' + f'&cat1_{parse.quote(category_1_name)}_cat2_{parse.quote(category_2_name)}_cat3_{parse.quote(category_3_name)}'
else:
cat3_url = 'https://search.suning.com/emall/mobile/clientSearch.jsonp?set=5&ps=120&channelId=MOBILE&keyword=' + parse.quote(category_3_name) + '&cf=&sc=0&cityId=459&ci=' + '&cf=&sc=0&cityId=459&cp=0&iv=-1&ct=-1&sp=&spf=&prune=1' + f'&cat1_{parse.quote(category_1_name)}_cat2_{parse.quote(category_2_name)}_cat3_{parse.quote(category_3_name)}'
else:
print('#'*100, '----Warning!')
cat3_url = 'https://search.suning.com/emall/mobile/clientSearch.jsonp?set=5&ps=120&channelId=MOBILE&keyword=' + parse.quote(category_3_name) + '&st=0&ci=&cf=&sc=&cityId=459&cp=0&iv=-1&ct=-1&sp=&spf=&prune=1' + f'&cat1_{parse.quote(category_1_name)}_cat2_{parse.quote(category_2_name)}_cat3_{parse.quote(category_3_name)}'
finally:
if 'cat2_品牌出版社_cat3_' not in parse.unquote(cat3_url):
print(cat3_url)
cat3_urls.append(cat3_url)
# else:
# print('#'*100, cat_3)
pp.pprint(cat_3)
print('\n\n')
except Exception as e:
print(e, i, j, category_2_name )
r = redis.Redis(host=REDIS_HOST, port=7379, db=0, password=REDIS_PASSWD, encoding='utf-8', decode_responses=True)
set_redis = {""}
set_redis.update(cat3_urls)
type(set_redis)
r.lpush('suning:start_urls', *set_redis)
print(f'num of cat3_urls: {len(cat3_urls)}')
for i in cat3_urls:
if 'cat2_品牌出版社_cat3_' not in parse.unquote(i):
print(parse.unquote(i))
这个分类json文件可以通过抓包抓取,代码如下:
import requests
from requests.exceptions import ConnectionError, ReadTimeout
import pprint
import json
url = 'https://search.suning.com/emall/mobile/clientSearch.jsonp?set=5&ps=120&channelId=MOBILE&keyword=11513369305'
cat_url = 'https://ds.suning.com/ds/terminal/categoryInfo/v1/99999998-.jsonp'
header = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
pp = pprint.PrettyPrinter(indent=4)
# try:
# response = requests.get(url, timeout=10, headers=header)
# if response.status_code == 200:
# print('yes connected')
# # print(response.text)
# json_dict = response.json()
# pp.pprint(json_dict['goods'])
# except (ConnectionError, ReadTimeout):
# print('connect failed')
try:
response = requests.get(cat_url, timeout=10, headers=header)
if response.status_code == 200:
print('yes connected')
# print(response.text)
json_dict = response.json()
pp.pprint(json_dict['rs'])
json.dumps('suning_cat.json')
with open(r'E:\splash\suning\suning_cat.json', 'w') as f:
json.dump(json_dict, f, ensure_ascii=False, indent=4, sort_keys=True)
except (ConnectionError, ReadTimeout):
print('connect failed')
url构造好之后lpush到redis即可分布式爬虫了。
工具:scrapy-redis, mongodb, adsl vps自建代理池, gerapy分布式爬虫管理和监控
二话不说,直接上关键代码:
# # -*- coding: utf-8 -*-
import scrapy
import json
from scrapy_redis.spiders import RedisSpider
try:
from urllib.parse import urlparse
except ImportError:
from urlparse import urlparse
from suning.log import logger
from suning.items import SuningItem
import re
from urllib import parse
class SuningspdSpider(RedisSpider):
name = 'suningspd'
allowed_domains = ['suning.com']
redis_key = 'suning:start_urls'
def __init__(self, *args, **kwargs):
super(SuningspdSpider, self).__init__(*args, **kwargs)
def parse(self, response):
url = response.url
items = []
if len(response.text) < 50 or response.status==402:
logger.info(f'返回response为空,可能是最后一页或者超时了: {response.url}')
else:
goods = json.loads(response.text)['goods']
logger.info(f'goods: {parse.unquote(url)}')
goodsCount = json.loads(response.text)["goodsCount"]
logger.info(f'goodsCount: {goodsCount}')
# 抓取的这个接口最多显示50页,每页120个商品,和PC端的限制是一样的
max_page = min(int(goodsCount/120) + 1, 50)
for good in goods:
item = SuningItem()
# logger.info(f'good: {good}\n\n')
good_id = good['partnumberVendorId']
good_url = 'https://product.suning.com/' + good_id.split('_')[1] +'/' + good_id.split('_')[0] + '.html'
# logger.info(good_id)
item['_id'] = good_id
item['good_sku_id'] = good_id
item['good_name'] = good["catentdesc"]
item['url'] = good_url
item['cat1'] = url.split('&')[-1].split('_')[1]
item['cat2'] = url.split('&')[-1].split('_')[3]
item['cat3'] = url.split('&')[-1].split('_')[5]
item['good_info'] = json.dumps(good, ensure_ascii=False)
# logger.info(f'check item: {item}')
items.append(item)
logger.warning(f'check num of items: {len(set(items))}')
yield item
page = re.search('cp=\d+', url)[0].split('=')[1]
if int(page) < max_page-1:
next_page = int(page) + 1
logger.info(f'总共{max_page}页,还有下一页。继续采集第{next_page}页')
next_url = re.sub(f'cp={page}', f'cp={next_page}', url)
logger.info(f'next_url: {parse.unquote(next_url)}')
# yield response.follow(next_url, self.parse)
yield scrapy.Request(url=next_url, callback=self.parse, dont_filter=False)
即使移动端接口也有每页120个商品,单分类下50页的限制,所以对于一些热门的分类,最多也只能爬取6000个SKU,但是这里我用了细分类,一共2397个,漏网之鱼不多。一个服务器加上六七个adsl vps拨号代理服务器,一天就能爬完全站商品,最后抓取了2843855个SKU,漏网之鱼肯定不少,毕竟我没有去PC端按照细分类去遍历所有的SKU,如果需要全部SKU,则需要去PC端去按照细分类+品牌分类的方式去遍历SKU,爬取京东450万SKU我就是用的这种方式。
获取的商品sku信息如下:
'good_name': '宏碁Acer 暗影骑士·擎 15.6英寸电竞游戏本RGB背光学生笔记本电脑(i7-10750H 32G 1TB+1TBSSD '
'GTX1660Ti 6G 144Hz)定制',
'good_sku_id': '11961839381_0070174045',
'url': 'https://product.suning.com/0070174045/11961839381.html'}}, {'index': 33, 'code': 11000, 'keyPattern': {'_id': 1}, 'keyValue': {'_id': '12200322581_0070640451'}, 'errmsg': 'E11000 duplicate key error collection: suning.product index: _id_ dup key: { _id: "12200322581_0070640451" }', 'op': {'_id': '12200322581_0070640451',
'cat1': '%E4%BA%8C%E6%89%8B%E4%BC%98%E5%93%81',
'cat2': '%E7%94%B5%E8%84%91%E5%8A%9E%E5%85%AC',
'cat3': '%E7%AC%94%E8%AE%B0%E6%9C%AC',
'good_info': '{"inventory": "0", "auxdescription": "北京 上海 南京 郑州 '
'武汉都有库房,就近安排,即送大礼包,详情请咨询客服", "catentdesc": "联想ThinkPad T14 15CD '
'英特尔酷睿i5 14英寸轻薄笔记本电脑(i5-10210U 32G 1TSSD固态 Win10)高分屏 红外摄像头 定制", '
'"catentryId": "12200322581", "countOfarticle": "0", '
'"partnumber": "12200322581", "totalCount": "0", "price": '
'"7699.0", "brandId": "000052450", "saleStatus": 0, '
'"contractInfos": "", "snFlag": false, "author": "", '
'"suningSale": false, "praiseRate": "", "salesCode": '
'"0070640451", "salesName": "麦田电脑旗舰店", "salesCode10": '
'"0070640451", "beancurdFlag": "0", "goodsType": "1", '
'"priceType": "2", "filters": [{}, {}, {}], "filterAttr": true, '
'"isFav": true, "hwgLable": false, "spsLable": false, '
'"baoguangHwg": "0", "dynamicImg": '
'"//imgservice4.suning.cn/uimg1/b2c/image/Ews79nBjIFzU-qZ2hLdLoA.jpg", '
'"docType": 1, "specificUrl": "http://18181818.suning.com", '
'"salesUrl": "", "threeGroupId": "258004@507196", '
'"threeGroupName": "创意设计笔记本,笔记本", "priority": "2", "orderType": '
'"0", "snpmDity": "2", "shortBrandId": "2450", "isNew": "1", '
'"partnumberVendorId": "12200322581_0070640451", "msList": [], '
'"salesCount": "0", "jlfDirGroupIdList": [], "xdGroupIDCopy": '
'[], "mdGoodType": "", "catalog": '
'"NPC,ALL,NO,XDALL,10051,SNZB,NGD,ftzm,fnaep,wx001,vedio,wx002,XDC,wx003,NH", '
'"isSnPharmacy": false, "extenalFileds": {"lpg_activeId": '
'"null", "brandNameZh": "ThinkPad", "detailsUrl": '
'"https://m4.pptvyun.com/pvod/e11a0/cjKfRo_ggIlgD6XsDEgSWDXrrCs/eyJkbCI6MTU5MDQ3OTg4NywiZXMiOjYwNDgwMCwiaWQiOiIwYTJkb3FpZHBLZWRucS1MNEsyZG9hZmhvNmljb0thY29hayIsInYiOiIxLjAifQ/0a2doqidpKednq-L4K2doafho6icoKacoak.mp4", '
'"ZYLY": "0", "goodType": "Z001", "attrShow": [{"attrAppDesc": '
'"", "attrAppTrueValue": "集成显卡", "attrAppValue": "集成显卡", '
'"attrId": "solr_1855_attrId", "attrName": "显卡类型", "attrValue": '
'"集成显卡", "attrValueId": "24087", "sort": "92.0"}, '
'{"attrAppDesc": "", "attrAppTrueValue": "Intel i5", '
'"attrAppValue": "Intel i5", "attrId": "solr_6160_attrId", '
'"attrName": "CPU类型", "attrValue": "Intel i5", "attrValueId": '
'"attrAppValue": "Intel i5", "attrId": "solr_6160_attrId", '
'"attrName": "CPU类型", "attrValue": "Intel i5", "attrValueId": '
'"45969", "sort": "91.0"}, {"attrAppDesc": "", '
'"attrAppTrueValue": "1TB", "attrAppValue": "1TB", "attrId": '
'"solr_2088_attrId", "attrName": "硬盘容量", "attrValue": "1TB", '
'"attrValueId": "2280366", "sort": "0.0"}], "mdmGroupId": '
'"R1502001", "appAttrTitle": ["商务办公", "轻薄便捷", "红外摄像头"], '
'"groupIDCopy": ["157122:258003:258004", '
'"157122:258003:507196"], "specificUrl": '
'"http://18181818.suning.com", "groupIDCombination": '
'["157122@A@电脑/办公/外设:258003@A@电脑整机:258004@A@笔记本", '
'"157122@A@电脑/办公/外设:258003@A@电脑整机:507196@A@创意设计笔记本"], '
'"activationFlag": "1", "commentShow": "0", "auxdescription": '
'"北京 上海 南京 郑州 武汉都有库房,就近安排,即送大礼包,详情请咨询客服", "paramValue": "T14"}}',
如果需要评论数据,也很简单,需要两步:
- 抓取移动端甚至PC端评论接口,非常简单
- 去PC端遍历每个SKU,获取clusterID,这个参数是必须的,而且是js生成的,如果你有时间去研究它的JS代码应该也能找到这个参数的生成算法,但是苏宁反爬不厉害,不如去遍历每个商品,返回的response.text里有script,script里就含有这个关键字,正则提取即可。我的服务器硬盘空间快满了,得等段时间才能采集了。评论数量特别大,非常占空间。