scrapy分布式性能优化

爬虫是IO密集型,所以设计爬虫系统的时候必须考虑这一点。有几点需要注意:

  1. scrapy并不是并发数越高越好,这个完全取决于服务器带宽和代理IP的质量和数量。这里有篇文章写的很好:https://my.oschina.net/u/4402731/blog/3568083 ,CLOSESPIDER_ITEMCOUNT = 100,测试下不同参数的耗时即可
  2. 必须要考虑各个节点之间的网络通讯,我在用本地笔记本和远程服务器都跑过scrapy,但是主节点(redis调度和去重)是北京的腾讯云服务器,所以在本地跑爬虫速度明显慢很多,肉眼可见。毕竟大部分时间都耗在了网络通讯上。
  3. 小规模的爬虫不需要大规模的集群,瓶颈往往在代理池上。如果需求量大,自建代理池肯定是必须的。我自己搭建的代理池非常稳定,两个adsl vps轮播,20秒拨号一次,一次拨号耗时6秒,IP放到redis去重,如果是重复的可以删除。IP放入redis的时候需要加个时间戳,方便调用时知晓IP剩余存活时间。当然,也可以针对网站建立IP黑名单,实现起来很简单,每一次被封该IP的value增加1,要知道每个网站封IP的策略不同,需要探索。这个值越大,它的权重就越小(可以给每个IP打分),被调用的优先级越小(这只是对大的代理池,如果只有几台就没有必要了)。这个哈希表隔一段时间可以重置一次,毕竟网站不可能一直封掉IP,过一段时间往往会解封的,不然太容易误伤用户。
下面的几个测试都是基于2秒download_timeout, download_delay=0
{'downloader/exception_count': 290,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 290,
 'downloader/request_bytes': 939629,
 'downloader/request_count': 2612,
 'downloader/request_method_count/GET': 2612,
 'downloader/response_bytes': 4586610,
 'downloader/response_count': 2322,
 'downloader/response_status_count/200': 2322,
 'elapsed_time_seconds': 228.855777, ######################
 'finish_reason': 'closespider_itemcount',
 'finish_time': datetime.datetime(2020, 12, 18, 5, 34, 52, 841964),
 'httpcache/firsthand': 2322,
 'httpcache/miss': 2612,
 'httpcache/store': 2322,
 'item_scraped_count': 104,
32并发数反而比16并发数耗时多了一倍!
{'downloader/exception_count': 47,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 47,
 'downloader/request_bytes': 326088,
 'downloader/request_count': 917,
 'downloader/request_method_count/GET': 917,
 'downloader/response_bytes': 1413362,
 'downloader/response_count': 870,
 'downloader/response_status_count/200': 870,
 'elapsed_time_seconds': 94.203294,  ######################
 'finish_reason': 'closespider_itemcount',
 'finish_time': datetime.datetime(2020, 12, 18, 5, 49, 1, 169299),
 'httpcache/firsthand': 870,
 'httpcache/miss': 917,
 'httpcache/store': 870,
 'item_scraped_count': 100,
8个并发数耗时比16个并发(120秒)还少了一点儿但是要注意这里的request_count明显降低了,正常情况5个request才会产生一个Item(调用了多个json接口),降低并发数retry次数降低明显。slow is faster!
下面并发数改为16,download_time=5, download_delay=0:
{'downloader/exception_count': 7,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 7,
 'downloader/request_bytes': 314651,
 'downloader/request_count': 896,
 'downloader/request_method_count/GET': 896,
 'downloader/response_bytes': 1474882,
 'downloader/response_count': 889,
 'downloader/response_status_count/200': 889,
 'elapsed_time_seconds': 69.643116,  ######################
 'finish_reason': 'closespider_itemcount',
 'finish_time': datetime.datetime(2020, 12, 18, 5, 56, 11, 42312),
 'httpcache/firsthand': 889,
 'httpcache/miss': 896,
 'httpcache/store': 889,
 'item_scraped_count': 106,
 'log_count/DEBUG': 1005,
 'log_count/INFO': 184,
 'memdebug/gc_garbage_count': 0,
 'memdebug/live_refs/Product1Spider': 1,
 'memdebug/live_refs/ProductItem': 7,
 'memdebug/live_refs/Request': 1,
 'memusage/max': 82305024,
 'memusage/startup': 55525376,
 'request_depth_max': 5,
 'response_received_count': 889,
 'retry/count': 7,
 'retry/reason_count/twisted.internet.error.TimeoutError': 7,
 'scheduler/dequeued/redis': 896,
 'scheduler/enqueued/redis': 1147,
 'start_time': datetime.datetime(2020, 12, 18, 5, 55, 1, 399196)}
咦,这里发现耗时70秒,看来增加download_timeout效果比较明显,retry次数已经降到了7次,耗时70秒!下面试试增加到32并发数,download_timeout保持5秒:
{'downloader/exception_count': 61,
 'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 13,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 48,
 'downloader/request_bytes': 454285,
 'downloader/request_count': 1329,
 'downloader/request_method_count/GET': 1329,
 'downloader/response_bytes': 2199392,
 'downloader/response_count': 1268,
 'downloader/response_status_count/200': 1268,
 'elapsed_time_seconds': 132.157984, ######################
 'finish_reason': 'closespider_itemcount',
 'finish_time': datetime.datetime(2020, 12, 18, 6, 4, 29, 718108),
 'httpcache/firsthand': 1268,
 'httpcache/miss': 1329,
 'httpcache/store': 1268,
 'item_scraped_count': 101,
 'log_count/DEBUG': 1433,
 'log_count/INFO': 229,
 'log_count/WARNING': 6,
 'memdebug/gc_garbage_count': 0,
 'memdebug/live_refs/Product1Spider': 1,
 'memdebug/live_refs/ProductItem': 52,
 'memdebug/live_refs/Request': 2,
 'memusage/max': 100614144,
 'memusage/startup': 55521280,
 'request_depth_max': 5,
 'response_received_count': 1268,
 'retry/count': 61,
 'retry/reason_count/twisted.internet.error.ConnectionRefusedError': 13,
 'retry/reason_count/twisted.internet.error.TimeoutError': 48,
 'scheduler/dequeued/redis': 1329,
 'scheduler/enqueued/redis': 1548,
 'start_time': datetime.datetime(2020, 12, 18, 6, 2, 17, 560124)}
看来这次增加并发数到32效率反而降低了,重试次数61,耗时132秒,比16并发数2秒超时的组合还要差一点儿。下面试试并发数32,超时1秒,重试次数从前面的3增加到6:
{'downloader/exception_count': 669,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 669,
 'downloader/request_bytes': 729406,
 'downloader/request_count': 2021,
 'downloader/request_method_count/GET': 2021,
 'downloader/response_bytes': 2514467,
 'downloader/response_count': 1352,
 'downloader/response_status_count/200': 1352,
 'elapsed_time_seconds': 181.738274, ######################
 'finish_reason': 'closespider_itemcount',
 'finish_time': datetime.datetime(2020, 12, 18, 6, 13, 34, 245444),
 'httpcache/firsthand': 1352,
 'httpcache/miss': 2021,
 'httpcache/store': 1352,
 'item_scraped_count': 101,
 'log_count/DEBUG': 2125,
 'log_count/ERROR': 2,
 'log_count/INFO': 296,
 'memdebug/gc_garbage_count': 0,
 'memdebug/live_refs/Product1Spider': 1,
 'memdebug/live_refs/ProductItem': 2,
 'memdebug/live_refs/Request': 1,
 'memusage/max': 112705536,
 'memusage/startup': 55529472,
 'request_depth_max': 5,
 'response_received_count': 1352,
 'retry/count': 669,
 'retry/reason_count/twisted.internet.error.TimeoutError': 669,
 'scheduler/dequeued/redis': 2021,
 'scheduler/enqueued/redis': 2576,
 'spider_exceptions/KeyError': 2,
 'start_time': datetime.datetime(2020, 12, 18, 6, 10, 32, 507170)}
这次测试非常失败,主要是因为我只有两个VPS拨号,如果量大就不一样了。目前来看并发数16,download_time=5, download_delay=0是最佳组合。再测试下download_delay,试试并发数16,download_time=5, download_delay=0.1:
{'downloader/exception_count': 50,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 50,
 'downloader/request_bytes': 281990,
 'downloader/request_count': 802,
 'downloader/request_method_count/GET': 802,
 'downloader/response_bytes': 1189507,
 'downloader/response_count': 752,
 'downloader/response_status_count/200': 752,
 'elapsed_time_seconds': 138.592611, ######################
 'finish_reason': 'closespider_itemcount',
 'finish_time': datetime.datetime(2020, 12, 18, 6, 20, 56, 171019),
 'httpcache/firsthand': 752,
 'httpcache/miss': 802,
 'httpcache/store': 752,
 'item_scraped_count': 102,
 'log_count/DEBUG': 907,
 'log_count/INFO': 298,
 'memdebug/gc_garbage_count': 0,
 'memdebug/live_refs/Product1Spider': 1,
 'memdebug/live_refs/ProductItem': 3,
 'memdebug/live_refs/Request': 2,
 'memusage/max': 72540160,
 'memusage/startup': 55529472,
 'request_depth_max': 5,
 'response_received_count': 752,
 'retry/count': 50,
 'retry/reason_count/twisted.internet.error.TimeoutError': 50,
 'scheduler/dequeued/redis': 802,
 'scheduler/enqueued/redis': 912,
 'start_time': datetime.datetime(2020, 12, 18, 6, 18, 37, 578408)}
和预想的一样,这里增加download_delay没有意义,因为并发数16,download_time=5, download_delay=0本来重试次数就很少了,这里的重试次数反而增加应该是代理引起的波动。因为拨号是20秒一次,如果这里是0.1意味着每秒最多请求10次,和之前的最优组合16并发比少了一些,对性能影响较大。

再优化下,试试并发数20,download_time=5, download_delay=0
{'downloader/exception_count': 12,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 12,
 'downloader/request_bytes': 323044,
 'downloader/request_count': 908,
 'downloader/request_method_count/GET': 908,
 'downloader/response_bytes': 1302214,
 'downloader/response_count': 896,
 'downloader/response_status_count/200': 896,
 'elapsed_time_seconds': 59.553894,  ######################
 'finish_reason': 'closespider_itemcount',
 'finish_time': datetime.datetime(2020, 12, 18, 6, 45, 3, 462824),
 'httpcache/firsthand': 896,
 'httpcache/miss': 908,
 'httpcache/store': 896,
 'item_scraped_count': 105,
 'log_count/DEBUG': 1016,
 'log_count/INFO': 188,
 'log_count/WARNING': 6,
 'memdebug/gc_garbage_count': 0,
 'memdebug/live_refs/Product1Spider': 1,
 'memdebug/live_refs/ProductItem': 56,
 'memdebug/live_refs/Request': 1,
 'memusage/max': 55513088,
 'memusage/startup': 55513088,
 'request_depth_max': 5,
 'response_received_count': 896,
 'retry/count': 12,
 'retry/reason_count/twisted.internet.error.TimeoutError': 12,
 'scheduler/dequeued/redis': 908,
 'scheduler/enqueued/redis': 941,
 'start_time': datetime.datetime(2020, 12, 18, 6, 44, 3, 908930)}
看起来并发数20,download_time=5, download_delay=0比并发数16,download_time=5, download_delay=0组合耗时缩短了10秒!重试次数12,没有多少提高,这才是关键。那再试试并发数24,download_time=5, download_delay=0:
{'downloader/exception_count': 30,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 30,
 'downloader/request_bytes': 310945,
 'downloader/request_count': 876,
 'downloader/request_method_count/GET': 876,
 'downloader/response_bytes': 1312479,
 'downloader/response_count': 846,
 'downloader/response_status_count/200': 846,
 'elapsed_time_seconds': 80.251456,
 'finish_reason': 'closespider_itemcount',
 'finish_time': datetime.datetime(2020, 12, 18, 6, 53, 0, 947113),
 'httpcache/firsthand': 846,
 'httpcache/miss': 876,
 'httpcache/store': 846,
 'item_scraped_count': 104,
 'log_count/DEBUG': 983,
 'log_count/INFO': 199,
 'log_count/WARNING': 7,
 'memdebug/gc_garbage_count': 0,
 'memdebug/live_refs/Product1Spider': 1,
 'memdebug/live_refs/ProductItem': 65,
 'memdebug/live_refs/Request': 1,
 'memusage/max': 86376448,
 'memusage/startup': 55500800,
 'request_depth_max': 5,
 'response_received_count': 846,
 'retry/count': 30,
 'retry/reason_count/twisted.internet.error.TimeoutError': 30,
 'scheduler/dequeued/redis': 876,
 'scheduler/enqueued/redis': 989,
 'start_time': datetime.datetime(2020, 12, 18, 6, 51, 40, 695657)}
其实第一次运行时用了100秒,这次80秒,后来测试70秒。因为IP代理波动和页面请求不同,波动比较大。但是可以看出来,并发数24相比于20已结没什么优势了。因为没有进行多次测试,上面的结论可能会有些误差,但是方法是没错的,scrapy就是需要这样迭代调优。

这次测试的前提是我只有两台VPS拨号,每次重试获取的IP至少50%概率还是原来的IP,所以重试的失败率非常高。如果有大量IP,策略需要调整。这里的目标网站是京东,看来单IP也就上到并发数16,再高了短时间内频繁访问失败的概率陡升,而因为代理IP太少重试也没有意义。所以我说,爬虫很多时候瓶颈都在代理IP,如果IP足够多,一两个性能不错的服务器都能在1天内把京东500多万SKU爬下来,因为这里调用的json接口,对带宽要求很低的,一个json文件才几到十几个kb吧。

我看过很多文章,多数写的不痛不痒,满篇的复制粘贴。对于很多网站来说,IP代理量大是最佳的解决方式,除非每次访问都会弹个验证码,否则没有必要去死抠验证码。验证码终极破解难度很大,门槛不是一般的高,我知道很多js代码都是几千行而且加了混淆。有这功夫还不如多买几台拨号服务器呢。对于大规模爬虫,买上几十台轮拨,一个月也就45元/台*N台的费用,量大的话服务商都能做镜像,一劳永逸方便部署。

爬虫是个需要调优的过程,并发数很高看起来美好,实际上谁用谁知道。你并发数上去了IP被封得也很快,一个安全比较好的网站10-20秒可能就把你的高并发给封了,除非你用的高质量IP,比如腾讯云的IP,我测试过,爬京东并发16-32也可以撑2分钟,而且被封后sleep一会儿还能接着再用一段时间,当然我没有做太久测试,毕竟被封了不方便我测试。拨号IP基本上都是被大家用烂的了,这些之所以还能用都是因为过了禁闭期,这类IP并不能长效。

scrapy httpcache刷新重试

scrapy重试仍然得不到数据的罪魁祸首可能是你开启了Httpcache

stackoverflow这个帖子写的很清楚了:I crawl a webpage page by page, by following urls in pagination. In some pages, website detects that I use a bot and gives me an error in html. Since it is a successful request, it caches the page and when I run it again, I get the same error.

https://stackoverflow.com/questions/41743071/scrapy-how-to-remove-a-url-from-httpcache-or-prevent-adding-to-cache

爬虫最常见的场景就是被反爬,scrapy开启Httpcache之后会在本地(项目目录下的.scrapy文件夹)缓存请求和响应进而加速爬虫,但是问题是你被反爬重试时这个缓存并没有更新,所以你得到的解析页面还是被反爬时的页,总是得不到数据。所以这里有必要重写中间件来刷新这个缓存。