scrapy进程无端被终止(kdevtmpfsi病毒利用scrapyd植入挖矿程序)

先来个总结:Redis无法分配内存时系统OOM选择性杀掉了scrapy进程,开始以为找到了解决办法,设置乐redis最大使用内存和允许over-commit-memory。但是最后才发现,是臭名昭著的kdevtmpfsi病毒利用scrapyd漏洞植入了我三台服务器。不得不说,安装scrapyd服务时放松了警惕,直接暴露端口到外网但是没有注意到这个服务没有加密,任何人都能在我的服务器上跑爬虫程序。参考这个帖子杀毒即可:https://blog.csdn.net/weixin_39334709/article/details/109135642 ,一定要注意删除定时任务。

看似简单,实际上耗费了我两天时间找出问题所在,用了strace, py-spy等工具,当然少不了google, stackoverflow。开始以为爬虫去重用了redis,以前没有这么高速爬取过,可能压力不大,没有出过问题,所以才会出现下面的linux系统上scrapy被kill但是跑windows上没有问题。

在更改了某个设置或者代码后,scrapy莫名其妙的被杀死了,奇怪的是没有任何日志(/var/log目录下没有相关日志),只是终端显示Killed。一看就知道类似kill -9 的输出,但是问题是我没有执行这个命令。py-spy虽然能看出资源利用率,但是进程被杀死后py-spy就退出了,没有任何帮助。而且,这个脚本在linux上才会被Kill,windows上跑没有任何问题,很是让人头疼。

strace的监控如下,能找到shutdown这个命令,但是看不出哪里发出的

strace有时候能看到shutdown命令的发出但是信息没有意义:

18:04:59.880446 connect(267, {sa_family=AF_INET, sin_port=htons(3389), sin_addr=inet_addr("114.104.128.151")}, 16) = -1 EINPROGRESS (Operation now in progress) <0.000043>
18:04:59.880533 epoll_ctl(5, EPOLL_CTL_ADD, 267, {EPOLLIN, {u32=267, u64=140496970187019}}) = 0 <0.000010>
18:04:59.880574 epoll_ctl(5, EPOLL_CTL_MOD, 267, {EPOLLIN|EPOLLOUT, {u32=267, u64=140496970187019}}) = 0 <0.000006>
18:04:59.880618 getsockopt(435, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 <0.000007>
18:04:59.880650 connect(435, {sa_family=AF_INET, sin_port=htons(3389), sin_addr=inet_addr("114.104.128.151")}, 16) = -1 EINPROGRESS (Operation now in progress) <0.000021>
18:04:59.880701 epoll_ctl(5, EPOLL_CTL_ADD, 435, {EPOLLIN, {u32=435, u64=140496970187187}}) = 0 <0.000007>
18:04:59.880735 epoll_ctl(5, EPOLL_CTL_MOD, 435, {EPOLLIN|EPOLLOUT, {u32=435, u64=140496970187187}}) = 0 <0.000007>
18:04:59.880834 epoll_wait(5, [{EPOLLOUT, {u32=493, u64=140496970187245}}, {EPOLLOUT, {u32=495, u64=140496970187247}}, {EPOLLOUT, {u32=483, u64=140496970187235}}, {EPOLLOUT, {u32=484, u64=140496970187236}}, {EPOLLIN, {u32=496, u64=140496970187248}}], 629, 0) = 5 <0.000013>
18:04:59.880931 sendto(493, "\27\3\3\0\23\223\316,\360\247K\355\302v{\312\r\327\271\377!\212\222q", 24, 0, NULL, 0) = 24 <0.000019>
18:04:59.881084 epoll_ctl(5, EPOLL_CTL_DEL, 493, 0x7ffd97c330cc) = 0 <0.000009>
18:04:59.881173 shutdown(493, SHUT_RDWR) = 0 <0.000016>
18:04:59.881213 close(493) = 0 <0.000012>
18:04:59.881372 sendto(495, "\27\3\3\0\23\22\302H\334\320\f\f\3348\352$\376,\261\254\277\317P\361", 24, 0, NULL, 0) = 24 <0.000018>
18:04:59.881431 epoll_ctl(5, EPOLL_CTL_DEL, 495, 0x7ffd97c330cc) = 0 <0.000008>
18:04:59.881504 shutdown(495, SHUT_RDWR) = 0 <0.000014>
18:04:59.881541 close(495) = 0 <0.000008>
18:04:59.881642 sendto(483, "\27\3\3\1+n\34\346\24\271/W\274q\312$\227\224}\302\25\22)\310\350M0j\375p\10Z"…, 304, 0, NULL, 0) = 304 <0.000016>
18:04:59.881699 epoll_ctl(5, EPOLL_CTL_MOD, 483, {EPOLLIN, {u32=483, u64=140496970187235}}) = 0 <0.000009>
18:04:59.881755 sendto(484, "\27\3\3\1-\335\330:\223E\277\31\244w\202\220\t\310\200\363o6?\346V\267\273&\35]\242z"…, 306, 0, NULL, 0) = 306 <0.000032>
18:04:59.881825 epoll_ctl(5, EPOLL_CTL_MOD, 484, {EPOLLIN, {u32=484, u64=140496970187236}}) = 0 <0.000007>
18:04:59.881881 recvfrom(496, "\26\3\3\0z\2\0\0v\3\3}\345\315\324\272\261\t\245O;\375\226\223\4+\313\332\v\\250+"…, 65536, 0, NULL, NULL) = 2776 <0.000011>
18:04:59.883097 write(1, "{'_id': '12331351',\n 'product_ad"…, 1985) = 1985 <0.000032>
18:04:59.883166 write(1, "\n", 1) = 1 <0.000007>
18:04:59.886004 epoll_wait(5, [], 627, 0) = 0 <0.000008>
18:04:59.886065 epoll_wait(5, [], 627, 18) = 0 <0.018731>
18:04:59.905986 epoll_wait(5, [], 627, 0) = 0 <0.000007>
18:04:59.906621 ioctl(14, FIONBIO, [1]) = 0 <0.000007>
18:04:59.906667 recvfrom(14, 0x55ed7f72e4d0, 65536, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) <0.000007>
18:04:59.906719 ioctl(14, FIONBIO, [1]) = 0 <0.000006>
18:04:59.906780 poll([{fd=14, events=POLLOUT}], 1, 120000) = 1 ([{fd=14, revents=POLLOUT}]) <0.000008>
18:04:59.906829 sendto(14, "3\r\n$4\r\nSADD\r\n$20\r\nproduct_1:dup"…, 88, 0, NULL, 0) = 88 <0.000019> 18:04:59.906895 poll([{fd=14, events=POLLIN}], 1, 120000) = 1 ([{fd=14, revents=POLLIN}]) <0.003565> 18:04:59.910493 recvfrom(14, ":1\r\n", 65536, 0, NULL, NULL) = 4 <0.000008> 18:04:59.910807 ioctl(14, FIONBIO, [1]) = 0 <0.000007> 18:04:59.910845 recvfrom(14, 0x55ed7f72e4d0, 65536, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) <0.000006> 18:04:59.910886 ioctl(14, FIONBIO, [1]) = 0 <0.000006> 18:04:59.910945 poll([{fd=14, events=POLLOUT}], 1, 120000) = 1 ([{fd=14, revents=POLLOUT}]) <0.000007> 18:04:59.910977 sendto(14, "4\r\n$4\r\nZADD\r\n$18\r\nproduct_1:req"…, 1671, 0, NULL, 0) = 1671 <0.000024>
18:04:59.911041 poll([{fd=14, events=POLLIN}], 1, 120000) = 1 ([{fd=14, revents=POLLIN}]) <0.002541>
18:04:59.913617 recvfrom(14, ":1\r\n", 65536, 0, NULL, NULL) = 4 <0.000009>
18:04:59.915223 epoll_wait(5, [], 627, 0) = 0 <0.000007>
18:04:59.916142 epoll_wait(5, [], 627, 0) = 0 <0.000007>
18:04:59.916677 ioctl(14, FIONBIO, [1]) = 0 <0.000007>
18:04:59.916719 recvfrom(14, 0x55ed7f72e4d0, 65536, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) <0.000007>
18:04:59.916763 ioctl(14, FIONBIO, [1]) = 0 <0.000006>
18:04:59.916831 poll([{fd=14, events=POLLOUT}], 1, 120000) = 1 ([{fd=14, revents=POLLOUT}]) <0.000007>
18:04:59.916865 sendto(14, "3\r\n$4\r\nSADD\r\n$20\r\nproduct_1:dup"…, 88, 0, NULL, 0) = 88 <0.000018> 18:04:59.916922 poll([{fd=14, events=POLLIN}], 1, 120000) = 1 ([{fd=14, revents=POLLIN}]) <0.002686> 18:04:59.919641 recvfrom(14, ":1\r\n", 65536, 0, NULL, NULL) = 4 <0.000008> 18:04:59.919924 ioctl(14, FIONBIO, [1]) = 0 <0.000007> 18:04:59.919960 recvfrom(14, 0x55ed7f72e4d0, 65536, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) <0.000006> 18:04:59.920001 ioctl(14, FIONBIO, [1]) = 0 <0.000006> 18:04:59.920058 poll([{fd=14, events=POLLOUT}], 1, 120000) = 1 ([{fd=14, revents=POLLOUT}]) <0.000007> 18:04:59.920089 sendto(14, "4\r\n$4\r\nZADD\r\n$18\r\nproduct_1:req"…, 1678, 0, NULL, 0) = 1678 <0.000023>
18:04:59.920151 poll([{fd=14, events=POLLIN}], 1, 120000) = 1 ([{fd=14, revents=POLLIN}]) <0.002500>
18:04:59.922685 recvfrom(14, ":1\r\n", 65536, 0, NULL, NULL) = 4 <0.000007>
18:04:59.924295 epoll_wait(5, [], 627, 0) = 0 <0.000007>
18:04:59.924357 epoll_wait(5, [{EPOLLIN, {u32=496, u64=140496970187248}}], 627, 803) = 1 <0.251454>
18:05:00.175939 recvfrom(496, "\350\241\"6V\377o5@\30\20\34\351z\277i\347\374\213\255:\236)\352\205\343\336\25\35@x\310"…, 65536, 0, NULL, NULL) = 1216 <0.000010>
18:05:00.180772 epoll_ctl(5, EPOLL_CTL_MOD, 496, {EPOLLIN|EPOLLOUT, {u32=496, u64=140496970187248}}) = 0 <0.000015>
18:05:00.180903 epoll_wait(5, [{EPOLLOUT, {u32=496, u64=140496970187248}}, {EPOLLIN, {u32=369, u64=140496970187121}}], 627, 546) = 2 <0.000008>
18:05:00.180978 sendto(496, "\24\3\3\0\1\1\27\3\3\0E\375\310Y\265\342\223\6\226\301\252\373\356\317\271\26\250otI\316\370"…, 429, 0, NULL, 0) = 429 <0.000020>
18:05:00.181046 epoll_ctl(5, EPOLL_CTL_MOD, 496, {EPOLLIN, {u32=496, u64=140496970187248}}) = 0 <0.000006>
18:05:00.181097 recvfrom(369, "\27\3\3\1\n\2701\2575\335\30\246\350\\370^k&2OB\351Y\27\3679\27\357\23\243\244\322"…, 65536, 0, NULL, NULL) = 542 <0.000008>
18:05:00.181258 epoll_wait(5, [], 627, 546) = 0 <0.546960>
18:05:00.728448 epoll_wait(5, [], 627, 0) = 0 <0.000007>
18:05:00.728541 epoll_wait(5, ) = ?
18:05:01.812104 +++ killed by SIGKILL +++
#!/bin/bash
scrapy crawl product_1
ret=$?
#
#  returns > 127 are a SIGNAL
#
if [ $ret -gt 127 ]; then
        sig=$((ret - 128))
        echo "Got SIGNAL $sig"
        if [ $sig -eq $(kill -l SIGKILL) ]; then
                echo "process was killed with SIGKILL"
                dmesg > $HOME/dmesg-kill.log
        fi
fi

结果:
test_k.sh: line 3: 11181 Killed scrapy crawl product_1
Got SIGNAL 9
process was killed with SIGKILL

试试这个,先关闭OOM:https://stackoverflow.com/questions/25000496/python-script-terminated-by-sigkill-rather-than-throwing-memoryerror

You can add these lines to your /etc/sysctl.conf file to effectively disable the OOM killer:

vm.overcommit_memory = 2
vm.overcommit_ratio = 100

后来我发现scrapy进程下有5个线程:

(boss) [chen@VM_0_2_centos jingdong]$ strace -f -v -e trace=signal -tt -p 7585
strace: Process 7585 attached with 5 threads

(boss) [chen@VM_0_2_centos ~]$ ps auxw -T | fgrep 7585
chen      7585  7585 16.4  1.9 634056 158620 pts/0   Sl+  23:13   4:35 /home/chen/anaconda3/envs/boss/bin/python /home/chen/anaconda3/envs/boss/bin/scrapy crawl product_1
chen      7585  7588  0.0  1.9 634056 158620 pts/0   Sl+  23:13   0:00 /home/chen/anaconda3/envs/boss/bin/python /home/chen/anaconda3/envs/boss/bin/scrapy crawl product_1
chen      7585  7589  0.0  1.9 634056 158620 pts/0   Sl+  23:13   0:00 /home/chen/anaconda3/envs/boss/bin/python /home/chen/anaconda3/envs/boss/bin/scrapy crawl product_1
chen      7585  7592  0.0  1.9 634056 158620 pts/0   Sl+  23:13   0:00 /home/chen/anaconda3/envs/boss/bin/python /home/chen/anaconda3/envs/boss/bin/scrapy crawl product_1
chen      7585  7593  0.0  1.9 634056 158620 pts/0   Sl+  23:13   0:00 /home/chen/anaconda3/envs/boss/bin/python /home/chen/anaconda3/envs/boss/bin/scrapy crawl product_1
chen      8584  8584  0.4  0.0 126348  1036 pts/2    S+   23:36   0:01 strace -f -v -e trace=signal -tt -p 7585

这让我很不解,scrapy是基于单线程的啊,只不过是基于twisted的异步,怎么会有5个threads呢?

https://stackoverflow.com/questions/14694582/stracing-to-attach-to-a-multi-threaded-process

strace -fp <pid> will show the trace of all threads owned by that process – even ones that process already has spawned before strace begins

那试试看 strace -fp 7585有什么输出吧。结果和之前猜测的一样,确实和redis有关。

[pid  7585] sendto(17, "*2\r\n$5\r\nHVALS\r\n$4\r\nadsl\r\n", 25, 0, NULL, 0) = 25
[pid  7585] recvfrom(17, "*1\r\n$45\r\nczhen:a8h54edrwg@36.57."..., 65536, 0, NULL, NULL) = 56
[pid  7585] ioctl(17, FIONBIO, [1])     = 0
[pid  7585] recvfrom(17, 0x564aace95f30, 65536, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid  7585] ioctl(17, FIONBIO, [0])     = 0
[pid  7585] sendto(17, "*2\r\n$5\r\nHVALS\r\n$4\r\nadsl\r\n", 25, 0, NULL, 0) = 25
[pid  7585] recvfrom(17, "*1\r\n$45\r\nczhen:a8h54edrwg@36.57."..., 65536, 0, NULL, NULL) = 56
[pid  7585] ioctl(17, FIONBIO, [1])     = 0
[pid  7585] recvfrom(17, 0x564aace95f30, 65536, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid  7585] ioctl(17, FIONBIO, [0])     = 0
[pid  7585] sendto(17, "*2\r\n$5\r\nHVALS\r\n$4\r\nadsl\r\n", 25, 0, NULL, 0) = 25
[pid  7585] recvfrom(17,  <unfinished ...>
[pid  7593] <... select resumed> <unfinished ...>) = ?
[pid  7593] +++ killed by SIGKILL +++
[pid  7592] <... select resumed> <unfinished ...>) = ?
[pid  7589] <... select resumed> <unfinished ...>) = ?
[pid  7588] <... select resumed> <unfinished ...>) = ?
[pid  7585] <... recvfrom resumed> <unfinished ...>) = ?
[pid  7592] +++ killed by SIGKILL +++
[pid  7588] +++ killed by SIGKILL +++
[pid  7589] +++ killed by SIGKILL +++
+++ killed by SIGKILL +++
Unhandled Error
Traceback (most recent call last):
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/scrapy/commands/crawl.py", line 27, in run
    self.crawler_process.start()
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/scrapy/crawler.py", line 327, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/twisted/internet/base.py", line 1283, in run
    self.mainLoop()
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/twisted/internet/base.py", line 1292, in mainLoop
    self.runUntilCurrent()
--- <exception caught here> ---
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/twisted/internet/base.py", line 913, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/scrapy/utils/reactor.py", line 50, in __call__
    return self._func(*self._a, **self._kw)
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/scrapy/core/engine.py", line 124, in _next_request
    if not self._next_request_from_scheduler(spider):
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/scrapy/core/engine.py", line 151, in _next_request_from_scheduler
    request = slot.scheduler.next_request()
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/scrapy_redis/scheduler.py", line 172, in next_request
    request = self.queue.pop(block_pop_timeout)
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/scrapy_redis/queue.py", line 115, in pop
    results, count = pipe.execute()
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/redis/client.py", line 3967, in execute
    return execute(conn, stack, raise_on_error)
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/redis/client.py", line 3863, in _execute_transaction
    self.immediate_execute_command('DISCARD')
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/redis/client.py", line 3789, in immediate_execute_command
    return self.parse_response(conn, command_name, **options)
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/redis/client.py", line 3926, in parse_response
    self, connection, command_name, **options)
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/redis/client.py", line 892, in parse_response
    response = connection.read_response()
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/redis/connection.py", line 752, in read_response
    raise response
redis.exceptions.ResponseError: DISCARD without MULTI

2020-12-18 23:52:39 VM_0_2_centos twisted[7585] CRITICAL Unhandled Error
Traceback (most recent call last):
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/scrapy/commands/crawl.py", line 27, in run
    self.crawler_process.start()
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/scrapy/crawler.py", line 327, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/twisted/internet/base.py", line 1283, in run
    self.mainLoop()
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/twisted/internet/base.py", line 1292, in mainLoop
    self.runUntilCurrent()
--- <exception caught here> ---
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/twisted/internet/base.py", line 913, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/scrapy/utils/reactor.py", line 50, in __call__
    return self._func(*self._a, **self._kw)
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/scrapy/core/engine.py", line 124, in _next_request
    if not self._next_request_from_scheduler(spider):
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/scrapy/core/engine.py", line 151, in _next_request_from_scheduler
    request = slot.scheduler.next_request()
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/scrapy_redis/scheduler.py", line 172, in next_request
    request = self.queue.pop(block_pop_timeout)
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/scrapy_redis/queue.py", line 115, in pop
    results, count = pipe.execute()
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/redis/client.py", line 3967, in execute
    return execute(conn, stack, raise_on_error)
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/redis/client.py", line 3863, in _execute_transaction
    self.immediate_execute_command('DISCARD')
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/redis/client.py", line 3789, in immediate_execute_command
    return self.parse_response(conn, command_name, **options)
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/redis/client.py", line 3926, in parse_response
    self, connection, command_name, **options)
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/redis/client.py", line 892, in parse_response
    response = connection.read_response()
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/redis/connection.py", line 752, in read_response
    raise response
redis.exceptions.ResponseError: DISCARD without MULTI

2020-12-18 23:52:39,764 CRITICAL: Unhandled Error
Traceback (most recent call last):
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/scrapy/commands/crawl.py", line 27, in run
    self.crawler_process.start()
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/scrapy/crawler.py", line 327, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/twisted/internet/base.py", line 1283, in run
    self.mainLoop()
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/twisted/internet/base.py", line 1292, in mainLoop
    self.runUntilCurrent()
--- <exception caught here> ---
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/twisted/internet/base.py", line 913, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/scrapy/utils/reactor.py", line 50, in __call__
    return self._func(*self._a, **self._kw)
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/scrapy/core/engine.py", line 124, in _next_request
    if not self._next_request_from_scheduler(spider):
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/scrapy/core/engine.py", line 151, in _next_request_from_scheduler
    request = slot.scheduler.next_request()
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/scrapy_redis/scheduler.py", line 172, in next_request
    request = self.queue.pop(block_pop_timeout)
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/scrapy_redis/queue.py", line 115, in pop
    results, count = pipe.execute()
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/redis/client.py", line 3967, in execute
    return execute(conn, stack, raise_on_error)
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/redis/client.py", line 3863, in _execute_transaction
    self.immediate_execute_command('DISCARD')
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/redis/client.py", line 3789, in immediate_execute_command
    return self.parse_response(conn, command_name, **options)
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/redis/client.py", line 3926, in parse_response
    self, connection, command_name, **options)
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/redis/client.py", line 892, in parse_response
    response = connection.read_response()
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/redis/connection.py", line 752, in read_response
    raise response
redis.exceptions.ResponseError: DISCARD without MULTI

2020-12-18 23:52:39 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/scrapy/commands/crawl.py", line 27, in run
    self.crawler_process.start()
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/scrapy/crawler.py", line 327, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/twisted/internet/base.py", line 1283, in run
    self.mainLoop()
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/twisted/internet/base.py", line 1292, in mainLoop
    self.runUntilCurrent()
--- <exception caught here> ---
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/twisted/internet/base.py", line 913, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/scrapy/utils/reactor.py", line 50, in __call__
    return self._func(*self._a, **self._kw)
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/scrapy/core/engine.py", line 124, in _next_request
    if not self._next_request_from_scheduler(spider):
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/scrapy/core/engine.py", line 151, in _next_request_from_scheduler
    request = slot.scheduler.next_request()
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/scrapy_redis/scheduler.py", line 172, in next_request
    request = self.queue.pop(block_pop_timeout)
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/scrapy_redis/queue.py", line 115, in pop
    results, count = pipe.execute()
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/redis/client.py", line 3967, in execute
    return execute(conn, stack, raise_on_error)
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/redis/client.py", line 3863, in _execute_transaction
    self.immediate_execute_command('DISCARD')
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/redis/client.py", line 3789, in immediate_execute_command
    return self.parse_response(conn, command_name, **options)
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/redis/client.py", line 3926, in parse_response
    self, connection, command_name, **options)
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/redis/client.py", line 892, in parse_response
    response = connection.read_response()
  File "/home/chen/anaconda3/envs/boss/lib/python3.7/site-packages/redis/connection.py", line 752, in read_response
    raise response
redis.exceptions.ResponseError: DISCARD without MULTI

Killed

那看看redis log吧:

7949:C 18 Dec 23:22:12.386 * RDB: 70 MB of memory used by copy-on-write
1175:M 18 Dec 23:22:12.445 * Background saving terminated with success
1175:M 18 Dec 23:27:13.069 * 10 changes in 300 seconds. Saving...
1175:M 18 Dec 23:27:13.069 * Background saving started by pid 8147
8147:C 18 Dec 23:27:15.698 * DB saved on disk
8147:C 18 Dec 23:27:15.698 * RDB: 26 MB of memory used by copy-on-write
1175:M 18 Dec 23:27:15.787 * Background saving terminated with success
1175:M 18 Dec 23:32:16.092 * 10 changes in 300 seconds. Saving...
1175:M 18 Dec 23:32:16.092 * Background saving started by pid 8426
8426:C 18 Dec 23:32:18.643 * DB saved on disk
8426:C 18 Dec 23:32:18.644 * RDB: 28 MB of memory used by copy-on-write
1175:M 18 Dec 23:32:18.712 * Background saving terminated with success
1175:M 18 Dec 23:37:19.072 * 10 changes in 300 seconds. Saving...
1175:M 18 Dec 23:37:19.073 * Background saving started by pid 8633
8633:C 18 Dec 23:37:21.643 * DB saved on disk
8633:C 18 Dec 23:37:21.644 * RDB: 28 MB of memory used by copy-on-write
1175:M 18 Dec 23:37:21.704 * Background saving terminated with success
1175:M 18 Dec 23:42:22.031 * 10 changes in 300 seconds. Saving...
1175:M 18 Dec 23:42:22.031 * Background saving started by pid 8976
8976:C 18 Dec 23:42:24.615 * DB saved on disk
8976:C 18 Dec 23:42:24.615 * RDB: 26 MB of memory used by copy-on-write
1175:M 18 Dec 23:42:24.658 * Background saving terminated with success
1175:M 18 Dec 23:47:25.029 * 10 changes in 300 seconds. Saving...
1175:M 18 Dec 23:47:25.029 * Background saving started by pid 9177
9177:C 18 Dec 23:47:27.533 * DB saved on disk
9177:C 18 Dec 23:47:27.534 * RDB: 20 MB of memory used by copy-on-write
1175:M 18 Dec 23:47:27.572 * Background saving terminated with success
1175:M 18 Dec 23:52:28.074 * 10 changes in 300 seconds. Saving...
1175:M 18 Dec 23:52:28.074 # Can't save in background: fork: Cannot allocate memory
1175:M 18 Dec 23:52:34.025 * 10 changes in 300 seconds. Saving...
1175:M 18 Dec 23:52:34.025 # Can't save in background: fork: Cannot allocate memory
1175:M 18 Dec 23:52:40.072 * 10 changes in 300 seconds. Saving...
1175:M 18 Dec 23:52:40.073 * Background saving started by pid 9392
9392:C 18 Dec 23:52:42.520 * DB saved on disk
9392:C 18 Dec 23:52:42.520 * RDB: 24 MB of memory used by copy-on-write
1175:M 18 Dec 23:52:42.602 * Background saving terminated with success
1175:M 19 Dec 00:07:43.043 * 1 changes in 900 seconds. Saving...
1175:M 19 Dec 00:07:43.044 * Background saving started by pid 9996
9996:C 19 Dec 00:07:45.624 * DB saved on disk
9996:C 19 Dec 00:07:45.624 * RDB: 16 MB of memory used by copy-on-write
1175:M 19 Dec 00:07:45.661 * Background saving terminated with success

解决方案:redis.conf设置最大内存2gb。记得改为下面的设置

vm.overcommit_memory = 1
vm.overcommit_ratio = 100

内核参数 vm.overcommit_memory 接受三种取值:

  • 0 – Heuristic overcommit handling. 这是缺省值,它允许overcommit,但过于明目张胆的overcommit会被拒绝,比如malloc一次性申请的内存大小就超过了系统总内存。Heuristic的意思是“试探式的”,内核利用某种算法(对该算法的详细解释请看文末)猜测你的内存申请是否合理,它认为不合理就会拒绝overcommit。
  • 1 – Always overcommit. 允许overcommit,对内存申请来者不拒。
  • 2 – Don’t overcommit. 禁止overcommit。

scrapy分布式性能优化

爬虫是IO密集型,所以设计爬虫系统的时候必须考虑这一点。有几点需要注意:

  1. scrapy并不是并发数越高越好,这个完全取决于服务器带宽和代理IP的质量和数量。这里有篇文章写的很好:https://my.oschina.net/u/4402731/blog/3568083 ,CLOSESPIDER_ITEMCOUNT = 100,测试下不同参数的耗时即可
  2. 必须要考虑各个节点之间的网络通讯,我在用本地笔记本和远程服务器都跑过scrapy,但是主节点(redis调度和去重)是北京的腾讯云服务器,所以在本地跑爬虫速度明显慢很多,肉眼可见。毕竟大部分时间都耗在了网络通讯上。
  3. 小规模的爬虫不需要大规模的集群,瓶颈往往在代理池上。如果需求量大,自建代理池肯定是必须的。我自己搭建的代理池非常稳定,两个adsl vps轮播,20秒拨号一次,一次拨号耗时6秒,IP放到redis去重,如果是重复的可以删除。IP放入redis的时候需要加个时间戳,方便调用时知晓IP剩余存活时间。当然,也可以针对网站建立IP黑名单,实现起来很简单,每一次被封该IP的value增加1,要知道每个网站封IP的策略不同,需要探索。这个值越大,它的权重就越小(可以给每个IP打分),被调用的优先级越小(这只是对大的代理池,如果只有几台就没有必要了)。这个哈希表隔一段时间可以重置一次,毕竟网站不可能一直封掉IP,过一段时间往往会解封的,不然太容易误伤用户。
下面的几个测试都是基于2秒download_timeout, download_delay=0
{'downloader/exception_count': 290,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 290,
 'downloader/request_bytes': 939629,
 'downloader/request_count': 2612,
 'downloader/request_method_count/GET': 2612,
 'downloader/response_bytes': 4586610,
 'downloader/response_count': 2322,
 'downloader/response_status_count/200': 2322,
 'elapsed_time_seconds': 228.855777, ######################
 'finish_reason': 'closespider_itemcount',
 'finish_time': datetime.datetime(2020, 12, 18, 5, 34, 52, 841964),
 'httpcache/firsthand': 2322,
 'httpcache/miss': 2612,
 'httpcache/store': 2322,
 'item_scraped_count': 104,
32并发数反而比16并发数耗时多了一倍!
{'downloader/exception_count': 47,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 47,
 'downloader/request_bytes': 326088,
 'downloader/request_count': 917,
 'downloader/request_method_count/GET': 917,
 'downloader/response_bytes': 1413362,
 'downloader/response_count': 870,
 'downloader/response_status_count/200': 870,
 'elapsed_time_seconds': 94.203294,  ######################
 'finish_reason': 'closespider_itemcount',
 'finish_time': datetime.datetime(2020, 12, 18, 5, 49, 1, 169299),
 'httpcache/firsthand': 870,
 'httpcache/miss': 917,
 'httpcache/store': 870,
 'item_scraped_count': 100,
8个并发数耗时比16个并发(120秒)还少了一点儿但是要注意这里的request_count明显降低了,正常情况5个request才会产生一个Item(调用了多个json接口),降低并发数retry次数降低明显。slow is faster!
下面并发数改为16,download_time=5, download_delay=0:
{'downloader/exception_count': 7,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 7,
 'downloader/request_bytes': 314651,
 'downloader/request_count': 896,
 'downloader/request_method_count/GET': 896,
 'downloader/response_bytes': 1474882,
 'downloader/response_count': 889,
 'downloader/response_status_count/200': 889,
 'elapsed_time_seconds': 69.643116,  ######################
 'finish_reason': 'closespider_itemcount',
 'finish_time': datetime.datetime(2020, 12, 18, 5, 56, 11, 42312),
 'httpcache/firsthand': 889,
 'httpcache/miss': 896,
 'httpcache/store': 889,
 'item_scraped_count': 106,
 'log_count/DEBUG': 1005,
 'log_count/INFO': 184,
 'memdebug/gc_garbage_count': 0,
 'memdebug/live_refs/Product1Spider': 1,
 'memdebug/live_refs/ProductItem': 7,
 'memdebug/live_refs/Request': 1,
 'memusage/max': 82305024,
 'memusage/startup': 55525376,
 'request_depth_max': 5,
 'response_received_count': 889,
 'retry/count': 7,
 'retry/reason_count/twisted.internet.error.TimeoutError': 7,
 'scheduler/dequeued/redis': 896,
 'scheduler/enqueued/redis': 1147,
 'start_time': datetime.datetime(2020, 12, 18, 5, 55, 1, 399196)}
咦,这里发现耗时70秒,看来增加download_timeout效果比较明显,retry次数已经降到了7次,耗时70秒!下面试试增加到32并发数,download_timeout保持5秒:
{'downloader/exception_count': 61,
 'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 13,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 48,
 'downloader/request_bytes': 454285,
 'downloader/request_count': 1329,
 'downloader/request_method_count/GET': 1329,
 'downloader/response_bytes': 2199392,
 'downloader/response_count': 1268,
 'downloader/response_status_count/200': 1268,
 'elapsed_time_seconds': 132.157984, ######################
 'finish_reason': 'closespider_itemcount',
 'finish_time': datetime.datetime(2020, 12, 18, 6, 4, 29, 718108),
 'httpcache/firsthand': 1268,
 'httpcache/miss': 1329,
 'httpcache/store': 1268,
 'item_scraped_count': 101,
 'log_count/DEBUG': 1433,
 'log_count/INFO': 229,
 'log_count/WARNING': 6,
 'memdebug/gc_garbage_count': 0,
 'memdebug/live_refs/Product1Spider': 1,
 'memdebug/live_refs/ProductItem': 52,
 'memdebug/live_refs/Request': 2,
 'memusage/max': 100614144,
 'memusage/startup': 55521280,
 'request_depth_max': 5,
 'response_received_count': 1268,
 'retry/count': 61,
 'retry/reason_count/twisted.internet.error.ConnectionRefusedError': 13,
 'retry/reason_count/twisted.internet.error.TimeoutError': 48,
 'scheduler/dequeued/redis': 1329,
 'scheduler/enqueued/redis': 1548,
 'start_time': datetime.datetime(2020, 12, 18, 6, 2, 17, 560124)}
看来这次增加并发数到32效率反而降低了,重试次数61,耗时132秒,比16并发数2秒超时的组合还要差一点儿。下面试试并发数32,超时1秒,重试次数从前面的3增加到6:
{'downloader/exception_count': 669,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 669,
 'downloader/request_bytes': 729406,
 'downloader/request_count': 2021,
 'downloader/request_method_count/GET': 2021,
 'downloader/response_bytes': 2514467,
 'downloader/response_count': 1352,
 'downloader/response_status_count/200': 1352,
 'elapsed_time_seconds': 181.738274, ######################
 'finish_reason': 'closespider_itemcount',
 'finish_time': datetime.datetime(2020, 12, 18, 6, 13, 34, 245444),
 'httpcache/firsthand': 1352,
 'httpcache/miss': 2021,
 'httpcache/store': 1352,
 'item_scraped_count': 101,
 'log_count/DEBUG': 2125,
 'log_count/ERROR': 2,
 'log_count/INFO': 296,
 'memdebug/gc_garbage_count': 0,
 'memdebug/live_refs/Product1Spider': 1,
 'memdebug/live_refs/ProductItem': 2,
 'memdebug/live_refs/Request': 1,
 'memusage/max': 112705536,
 'memusage/startup': 55529472,
 'request_depth_max': 5,
 'response_received_count': 1352,
 'retry/count': 669,
 'retry/reason_count/twisted.internet.error.TimeoutError': 669,
 'scheduler/dequeued/redis': 2021,
 'scheduler/enqueued/redis': 2576,
 'spider_exceptions/KeyError': 2,
 'start_time': datetime.datetime(2020, 12, 18, 6, 10, 32, 507170)}
这次测试非常失败,主要是因为我只有两个VPS拨号,如果量大就不一样了。目前来看并发数16,download_time=5, download_delay=0是最佳组合。再测试下download_delay,试试并发数16,download_time=5, download_delay=0.1:
{'downloader/exception_count': 50,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 50,
 'downloader/request_bytes': 281990,
 'downloader/request_count': 802,
 'downloader/request_method_count/GET': 802,
 'downloader/response_bytes': 1189507,
 'downloader/response_count': 752,
 'downloader/response_status_count/200': 752,
 'elapsed_time_seconds': 138.592611, ######################
 'finish_reason': 'closespider_itemcount',
 'finish_time': datetime.datetime(2020, 12, 18, 6, 20, 56, 171019),
 'httpcache/firsthand': 752,
 'httpcache/miss': 802,
 'httpcache/store': 752,
 'item_scraped_count': 102,
 'log_count/DEBUG': 907,
 'log_count/INFO': 298,
 'memdebug/gc_garbage_count': 0,
 'memdebug/live_refs/Product1Spider': 1,
 'memdebug/live_refs/ProductItem': 3,
 'memdebug/live_refs/Request': 2,
 'memusage/max': 72540160,
 'memusage/startup': 55529472,
 'request_depth_max': 5,
 'response_received_count': 752,
 'retry/count': 50,
 'retry/reason_count/twisted.internet.error.TimeoutError': 50,
 'scheduler/dequeued/redis': 802,
 'scheduler/enqueued/redis': 912,
 'start_time': datetime.datetime(2020, 12, 18, 6, 18, 37, 578408)}
和预想的一样,这里增加download_delay没有意义,因为并发数16,download_time=5, download_delay=0本来重试次数就很少了,这里的重试次数反而增加应该是代理引起的波动。因为拨号是20秒一次,如果这里是0.1意味着每秒最多请求10次,和之前的最优组合16并发比少了一些,对性能影响较大。

再优化下,试试并发数20,download_time=5, download_delay=0
{'downloader/exception_count': 12,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 12,
 'downloader/request_bytes': 323044,
 'downloader/request_count': 908,
 'downloader/request_method_count/GET': 908,
 'downloader/response_bytes': 1302214,
 'downloader/response_count': 896,
 'downloader/response_status_count/200': 896,
 'elapsed_time_seconds': 59.553894,  ######################
 'finish_reason': 'closespider_itemcount',
 'finish_time': datetime.datetime(2020, 12, 18, 6, 45, 3, 462824),
 'httpcache/firsthand': 896,
 'httpcache/miss': 908,
 'httpcache/store': 896,
 'item_scraped_count': 105,
 'log_count/DEBUG': 1016,
 'log_count/INFO': 188,
 'log_count/WARNING': 6,
 'memdebug/gc_garbage_count': 0,
 'memdebug/live_refs/Product1Spider': 1,
 'memdebug/live_refs/ProductItem': 56,
 'memdebug/live_refs/Request': 1,
 'memusage/max': 55513088,
 'memusage/startup': 55513088,
 'request_depth_max': 5,
 'response_received_count': 896,
 'retry/count': 12,
 'retry/reason_count/twisted.internet.error.TimeoutError': 12,
 'scheduler/dequeued/redis': 908,
 'scheduler/enqueued/redis': 941,
 'start_time': datetime.datetime(2020, 12, 18, 6, 44, 3, 908930)}
看起来并发数20,download_time=5, download_delay=0比并发数16,download_time=5, download_delay=0组合耗时缩短了10秒!重试次数12,没有多少提高,这才是关键。那再试试并发数24,download_time=5, download_delay=0:
{'downloader/exception_count': 30,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 30,
 'downloader/request_bytes': 310945,
 'downloader/request_count': 876,
 'downloader/request_method_count/GET': 876,
 'downloader/response_bytes': 1312479,
 'downloader/response_count': 846,
 'downloader/response_status_count/200': 846,
 'elapsed_time_seconds': 80.251456,
 'finish_reason': 'closespider_itemcount',
 'finish_time': datetime.datetime(2020, 12, 18, 6, 53, 0, 947113),
 'httpcache/firsthand': 846,
 'httpcache/miss': 876,
 'httpcache/store': 846,
 'item_scraped_count': 104,
 'log_count/DEBUG': 983,
 'log_count/INFO': 199,
 'log_count/WARNING': 7,
 'memdebug/gc_garbage_count': 0,
 'memdebug/live_refs/Product1Spider': 1,
 'memdebug/live_refs/ProductItem': 65,
 'memdebug/live_refs/Request': 1,
 'memusage/max': 86376448,
 'memusage/startup': 55500800,
 'request_depth_max': 5,
 'response_received_count': 846,
 'retry/count': 30,
 'retry/reason_count/twisted.internet.error.TimeoutError': 30,
 'scheduler/dequeued/redis': 876,
 'scheduler/enqueued/redis': 989,
 'start_time': datetime.datetime(2020, 12, 18, 6, 51, 40, 695657)}
其实第一次运行时用了100秒,这次80秒,后来测试70秒。因为IP代理波动和页面请求不同,波动比较大。但是可以看出来,并发数24相比于20已结没什么优势了。因为没有进行多次测试,上面的结论可能会有些误差,但是方法是没错的,scrapy就是需要这样迭代调优。

这次测试的前提是我只有两台VPS拨号,每次重试获取的IP至少50%概率还是原来的IP,所以重试的失败率非常高。如果有大量IP,策略需要调整。这里的目标网站是京东,看来单IP也就上到并发数16,再高了短时间内频繁访问失败的概率陡升,而因为代理IP太少重试也没有意义。所以我说,爬虫很多时候瓶颈都在代理IP,如果IP足够多,一两个性能不错的服务器都能在1天内把京东500多万SKU爬下来,因为这里调用的json接口,对带宽要求很低的,一个json文件才几到十几个kb吧。

我看过很多文章,多数写的不痛不痒,满篇的复制粘贴。对于很多网站来说,IP代理量大是最佳的解决方式,除非每次访问都会弹个验证码,否则没有必要去死抠验证码。验证码终极破解难度很大,门槛不是一般的高,我知道很多js代码都是几千行而且加了混淆。有这功夫还不如多买几台拨号服务器呢。对于大规模爬虫,买上几十台轮拨,一个月也就45元/台*N台的费用,量大的话服务商都能做镜像,一劳永逸方便部署。

爬虫是个需要调优的过程,并发数很高看起来美好,实际上谁用谁知道。你并发数上去了IP被封得也很快,一个安全比较好的网站10-20秒可能就把你的高并发给封了,除非你用的高质量IP,比如腾讯云的IP,我测试过,爬京东并发16-32也可以撑2分钟,而且被封后sleep一会儿还能接着再用一段时间,当然我没有做太久测试,毕竟被封了不方便我测试。拨号IP基本上都是被大家用烂的了,这些之所以还能用都是因为过了禁闭期,这类IP并不能长效。