Docker swarm部署splash服务集群+HAProxy负载均衡

docker stack deploy建立多服务集群并使用HAProxy搭建负载均衡

github中有一个非常好的单机多开splash Docker container的方案,但是,它仅仅适用于单机,如果是集群,它就无法应付了。这篇文章主要是解决集群部署splash的方案,Google半天也没有现成的方案,自己琢磨了下搞定了。

https://github.com/TeamHG-Memex/aquarium

适用平台: windows 10, centos 7.6

技术方案:利用yaml搭建Docker Swarm集群并部署多个服务(Visualizer, splash 3.5.0, HAproxy)

为什么用yml呢?看下下面docker service create,一次只能搭建一个服务,想搭建多个就不太方便,也不容易定制一些参数

集群建设

(boss) [chen@VM_0_3_centos aquarium]$ docker service create -p 8050:8050 –replicas 6 –name splash scrapinghub/splash /bin/bash
j3qtsdci0qum515nzoxpxu7u4
overall progress: 6 out of 6 tasks
1/6: running [==================================================>]
2/6: running [==================================================>]
3/6: running [==================================================>]
4/6: running [==================================================>]
5/6: running [==================================================>]
6/6: running [==================================================>]
verify: Service converged

Prerequisite: docker-ce (Linux平台需要单独安装,只安装docker是不够的) 或者Docker for windows, windows平台是集成化。

version: “3”

services:
visualizer:
image: dockersamples/visualizer
volumes:
– “/var/run/docker.sock:/var/run/docker.sock”
ports:
– 8080:8080
deploy:
placement:
constraints: [node.role == manager]
# labels:
# – com.df.notify=true
# – com.df.serviceDomain=visualizer.youclk.com
# – com.df.port=8080
# – com.df.usersSecret=admin

yml文件如下:

splash:
    image: scrapinghub/splash
    ports:
        - 8050:8050

    deploy:
        mode: replicated
        replicas: 2
        labels: [APP=SPLASH]
        # service resource management
        resources:
            # Hard limit - Docker does not allow to allocate more
            limits:
                cpus: '0.25'
                memory: 2048M
            # Soft limit - Docker makes best effort to return to it
            reservations:
                cpus: '0.25'
                memory: 2560M
        # service restart policy
        restart_policy:
            condition: any
            delay: 5s
            max_attempts: 3
            window: 120s

        # placement constraint - in this case on 'worker' nodes only
        placement:
            constraints: [node.role == worker]

(boss) [chen@VM_0_3_centos aquarium]$ docker stack deploy -c docker-compose.yml splash
Updating service splash_visualizer (id: lcfw3l45xkvewmly5yz7ukywh)
Updating service splash_splash (id: v1s5gbl9nfzmmp1hct6c8okce)

如果之前运行过上述命令,那么重新运行时会更新服务,方便扩展。因为replicas是2,每个服务器只创建了一个splash container。

然后我把replica改成4,Visualizer监控界面就看到了4个container:

来看一下其中一个服务器,能看到两个不同时间创建的container:

(boss) [chen@VM_0_2_centos ~]$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0ae3f202164d scrapinghub/splash:latest “python3 /app/bin/sp…” 3 minutes ago Up 3 minutes 8050/tcp splash_splash.1.kpj3rv7piojx4yq3h5w3vamtv
9d353fe1bd6a scrapinghub/splash:latest “python3 /app/bin/sp…” 27 minutes ago Up 26 minutes 8050/tcp splash_splash.3.bl3amsvyticoje9vv3yethp4c

那么,为什么说它是高可用呢?我们重启或者停下VM_0_2_centos下的一个container,我们看到集群重新建了一个container:

(boss) [chen@VM_0_2_centos ~]$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0ae3f202164d scrapinghub/splash:latest “python3 /app/bin/sp…” 3 minutes ago Up 3 minutes 8050/tcp splash_splash.1.kpj3rv7piojx4yq3h5w3vamtv
9d353fe1bd6a scrapinghub/splash:latest “python3 /app/bin/sp…” 27 minutes ago Up 26 minutes 8050/tcp splash_splash.3.bl3amsvyticoje9vv3yethp4c
(boss) [chen@VM_0_2_centos ~]$ docker stop 0ae3f202164d
0ae3f202164d
(boss) [chen@VM_0_2_centos ~]$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d9253824289f scrapinghub/splash:latest “python3 /app/bin/sp…” 39 seconds ago Up 33 seconds 8050/tcp splash_splash.1.fo051grhynfybzbh407a1hith
9d353fe1bd6a scrapinghub/splash:latest “python3 /app/bin/sp…” 30 minutes ago Up 29 minutes 8050/tcp splash_splash.3.bl3amsvyticoje9vv3yethp4c

这个集群会努力保持集群中始终有4个container,如果一个死掉或者重启了会立即重建一个container,如果一个服务器宕机了,那么会在其他服务器重建4个container。当然,前提是你的服务器没有全部GG了

文中没有细说集群建设,网络中到处是这种帖子,简单说下:

创建管理节点manager node:

docker swarm init –advertise-addr=Manager IP

运行这个命令后会提示一个token,用这个token在worker node上运行下面命令即可加入集群,非常方便,不需要考虑网络问题:

docker swarm join –token SWMTKN-1-0zhhp71dfw77axxiuoixrechl8kcd9tapi9tbqskgtvt9yxxxxxxxxxxxxxxxxxxxxxxx –advertise-addr=WORKER IP:8050

HAproxy负载均衡

虽然swarm据说也有负载均衡,但是我测试了下,它的负载均衡仅仅限于一个节点上,外面还要套一层负载均衡才能对外服务,这个可以参考Youtube上的微软工程师的视频: https://www.youtube.com/watch?v=ZfMV5JmkWCY&t=170s

视频一共三个,第三个提到了负载均衡提供对外服务。这里我用一个性能很弱的腾讯云服务器做的负载均衡,也是docker集群的管理节点。

HAProxy监控界面
# HAProxy 1.7 config for Splash. It assumes Splash instances are executed
# on the same machine and connected to HAProxy using Docker links.
global
    # raise it if necessary
    maxconn 512
    # required for stats page
    stats socket /tmp/haproxy

userlist users
    user USER insecure-password PASSWD

defaults
    log global
    mode http

    # remove requests from a queue when clients disconnect;
    # see https://cbonte.github.io/haproxy-dconv/1.7/configuration.html#4.2-option%20abortonclose
    option abortonclose

    # gzip can save quite a lot of traffic with json, html or base64 data
    # compression algo gzip
    compression type text/html text/plain application/json

    # increase these values if you want to
    # allow longer request queues in HAProxy
    timeout connect 3600s
    timeout client 3600s
    timeout server 3600s


# visit 0.0.0.0:8036 to see HAProxy stats page
listen stats
    bind *:8036
    mode http
    stats enable
    stats hide-version
    stats show-legends
    stats show-desc Splash Cluster
    stats uri /
    stats refresh 10s
    stats realm Haproxy\ Statistics
    stats auth    admin:adminpass


# Splash Cluster configuration
# 代理服务器监听全局的8050端口
frontend http-in
    bind *:8050
    # 如果你需要开启Splash的访问认证
    # 则注释default_backend splash-cluster
    # 并放开其余default_backend splash-cluster 之上的其余注释
    # 账号密码为user  userpass
    acl auth_ok http_auth(users)
    http-request auth realm Splash if !auth_ok
    http-request allow if auth_ok
    http-request deny

    acl staticfiles path_beg /_harviewer/
    acl misc path / /info /_debug /debug

    use_backend splash-cluster if auth_ok !staticfiles !misc
    use_backend splash-misc if auth_ok staticfiles
    use_backend splash-misc if auth_ok misc
    default_backend splash-cluster


backend splash-cluster
    option httpchk GET /
    balance leastconn

    # try another instance when connection is dropped
    retries 2
    option redispatch
    # 将下面IP地址替换为你自己的Splash服务IP和端口
    # 按照以下格式一次增加其余的Splash服务器
    server splash-0 SPLASH0_IP:8050 check maxconn 50 inter 2s fall 10 observe layer4
    server splash-1 SPLASH1_IP:8050 check maxconn 50 inter 2s fall 10 observe layer4

backend splash-misc
    balance roundrobin
    # 将下面IP地址替换为你自己的Splash服务IP和端口
    # 按照以下格式一次增加其余的Splash服务器
    server splash-0 SPLASH0_IP:8050 check fall 15
    server splash-1 SPLASH1_IP:8050 check fall 15

重磅: scrapy crawl SPIDER -a http_user=’USER’ -a http_pass=’PASSWD’

splash加密访问之后,如何在scrapy项目中利用是个问题,这个命令行解决了一大难题。网络上多数帖子都是抄袭的,这种关键问题却是没几个人提到。如果像我一样你想用多进程,一个服务器开多个scrapy进程,那么下面的脚本能解决你的问题:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from multiprocessing import Pool
import os, time, random, multiprocessing

def long_time_task(name):
    # 如下代码中加入异常捕捉后可以正常返回结果,防止主进程一直被阻塞
    pid=os.getpid()
    try:
        print('pid:%d'%pid)
        print('Run task %s (%s)...' % (name, os.getpid()))
        start = time.time()
        process = CrawlerProcess(get_project_settings())
        process.crawl('lagouspd', domain='lagou.com', http_user='USER', http_pass='PASSWD')
        process.start(stop_after_crawl = False) # the script will block here until the crawling is finished       
        end = time.time()
        print('Task %s runs %0.2f seconds.' % (name, (end - start)))       
        time.sleep(2)
    except KeyboardInterrupt:
        print('进程%d被中断...'%pid)



if __name__=='__main__':
    print('Parent process %s.' % os.getpid())
    try:
        p = Pool(multiprocessing.cpu_count()*4)
        for i in range(multiprocessing.cpu_count()*4):
            p.apply_async(long_time_task, args=(i,))
            time.sleep(random.random() * 3)
        print('Waiting for all subprocesses done...')
        p.close()
        p.join()
        print('All subprocesses done.')
    except KeyboardInterrupt:
        print('catch keyboardinterupterror')
        pid=os.getpid()
        os.popen('taskkill.exe /f /pid:%d'%pid) #在unix下无需此命令,但在windows下需要此命令强制退出当前进程
    except Exception as e:
        print(e)
    else:
        print('quit normally')        

脚本中的用户名密码就是上面的USER, PASSWD,不然splash没法使用。该脚本还解决了multiprocessing多进程用进程池时没法CTRL C暂停爬虫的问题

Docker scrapy redis spalsh 爬虫踩坑

首先说个结论,很多问题都是墙引起的.

1.Docker如果不设置好proxy,用起来会非常痛苦.我用了Proxychains + 本地socks5代理,不然速度慢的让你有种被虐的感觉.当然,临时用也可以用export的方式,比如Ubuntu中可以用sudo apt-get -o Acquire::socks::proxy=”socks://127.0.0.1:8399/” update

2.Docker splash安装后会发现在Web端无法渲染有些网站,这里也是Github源码中用了一些境外的js,这些被墙了.对,没错,这些也能被墙.下面链接可以参考,主要是替换resources.py中的两个文件http://libs.baidu.com/jquery/1.11.1/jquery.min.js, https://ajax.aspnetcdn.com/ajax/jquery.migrate/jquery-migrate-1.2.1.js

docker ps

docker exec -u 0 -it ed39a8b02925 /bin/bash

https://blog.csdn.net/qq_42078965/article/details/108812636

说到代理,下面的代码适合socks5代理github,不然访问github太慢了:

git config --global http.proxy 'socks5://127.0.0.1:1080'
git config --global https.proxy 'socks5://127.0.0.1:1080'

3. WSL内存管理有问题,splash运行一段时间机器内存就会炸掉。目前的workaround是用脚本监控系统内存并重启container:

import psutil
import os 
import time
import subprocess

def id_find():
    tmp = str(subprocess.check_output('sudo docker ps', shell=True)).split('\\n')
    print(tmp)
    if tmp[1] ==  "'":
        print('no container_id exists')
        return 0
    else:        
        print('one container_id found')
        container_id = tmp[1].split(' ')[0]
        print('container_id 1:', container_id)
        return container_id    
    
while True:
    mem = psutil.virtual_memory()
    total = str(round(mem.total / 1024 / 1024))
    #round方法进行四舍五入,然后转换成字符串 字节/1024得到kb 再/1024得到M
    used = str(round(mem.used / 1024 / 1024))
    use_per = str(round(mem.percent))
    free = str(round(mem.free / 1024 / 1024))
    print("您当前的内存大小为:" + total + "M")
    print("已使用:" + used + "M(" + use_per + "%)")
    print("可用内存:" + free + "M")
    container_id = id_find()
    print('container_id return:', container_id)
    if container_id == 0:        
        time.sleep(3)
        os.system('sudo docker run -itd -v ~/default.ini -p 8050:8050 scrapinghub/splash /bin/bash --max-timeout=3600')
        time.sleep(5)
        # os.system('sudo docker container prune') # remove all stopped containers. CAUTION!!!!
    else:
        print('one container_id found: ', container_id)
        if int(use_per) > 90:
            os.system(f'sudo docker restart {container_id}')
            # os.system(f'sudo docker run -itd -v lagoucrawl\default.ini -p 8050:8050 scrapinghub/splash /bin/bash --max-timeout=600')
            time.sleep(3)
            # os.system('sudo docker container prune') # remove all stopped containers. CAUTION!!!!
        else:
            time.sleep(3)

这里的default.ini是我的多贝云代理文件,内容如下:

[proxy]
   
host=http-proxy-t1.dobel.cn
port=9180

username=BOSSZxxxxxxx
password=yyyyyyy

因为我用了scrapy-redis-splash这个框架,不能直接套用scrapy代理的方式,实践证明这个是有效也是最简单的splash proxy方法。

说到这儿,顺便说下docker的加速(以阿里云为例):

{
  "registry-mirrors": [
    "https://qbsssss.mirror.aliyuncs.com"
  ],
  "insecure-registries": [],
  "debug": false,
  "experimental": false,
  "features": {
    "buildkit": true
  }
}

上面的连接登录阿里云之后能找到,Google下即可。这里有个坑,如果你用国外阿里云账号登录,是死活找不到这个链接的!

因为GFW的缘故,国内很多技术人员获取最先进的技术资料是很困难的。仅仅说proxy,不同的平台设置都不一样,github, wget, docker。。。。。。即使是proxychains也不是一站式解决方案。

4. Redis看起来简单,也是很多坑需要踩的。比如,强制停止爬虫程序后重启之前可能需要删除redis中的requests,不然爬虫程序可能无法继续。