HAPROXY实现splash负载均衡

本来用了Nginx做了负载均衡,但是看到Nginx只有付费版才支持internal queque,所以用HAproxy重新做了负载均衡。

目的:减轻splash节点的负担并且更好地利用资源(splash重启也不怕!)

HA-Proxy version 1.5.18 2016/05/10。为什么要提版本呢?说多了都是泪,linux上的任何软件都有版本特性,比如崔大神的指南我用在了2.3最新版上就无法启动,原因是新版不用init.d脚本了:

https://cuiqingcai.com/4826.html

而且编译起来很麻烦,最后用了懒人办法 yum install haproxy,只需要写配置文件就可以了:

# HAProxy 1.7 config for Splash. It assumes Splash instances are executed
# on the same machine and connected to HAProxy using Docker links.
global
    # raise it if necessary
    maxconn 512
    # required for stats page
    stats socket /tmp/haproxy

userlist users
    user user insecure-password userpass

defaults
    log global
    mode http

    # remove requests from a queue when clients disconnect;
    # see https://cbonte.github.io/haproxy-dconv/1.7/configuration.html#4.2-option%20abortonclose
    option abortonclose

    # gzip can save quite a lot of traffic with json, html or base64 data
    # compression algo gzip
    compression type text/html text/plain application/json

    # increase these values if you want to
    # allow longer request queues in HAProxy
    timeout connect 3600s
    timeout client 3600s
    timeout server 3600s


# visit 0.0.0.0:8036 to see HAProxy stats page
listen stats
    bind *:8036
    mode http
    stats enable
    stats hide-version
    stats show-legends
    stats show-desc Splash Cluster
    stats uri /
    stats refresh 10s
    stats realm Haproxy\ Statistics
    stats auth    username:password


# Splash Cluster configuration
# 代理服务器监听全局的8050端口
frontend http-in
    bind *:8050
    # 如果你需要开启Splash的访问认证
    # 则注释default_backend splash-cluster
    # 并放开其余default_backend splash-cluster 之上的其余注释
    # 账号密码为user  userpass
    # acl auth_ok http_auth(users)
    # http-request auth realm Splash if !auth_ok
    # http-request allow if auth_ok
    # http-request deny

    # acl staticfiles path_beg /_harviewer/
    # acl misc path / /info /_debug /debug

    # use_backend splash-cluster if auth_ok !staticfiles !misc
    # use_backend splash-misc if auth_ok staticfiles
    # use_backend splash-misc if auth_ok misc
    default_backend splash-cluster


backend splash-cluster
    option httpchk GET /
    balance leastconn

    # try another instance when connection is dropped
    retries 2
    option redispatch
    # 将下面IP地址替换为你自己的Splash服务IP和端口
    # 按照以下格式一次增加其余的Splash服务器
    server splash-0 x.x.x.x:8050 check maxconn 5 inter 2s fall 10 observe layer4
    server splash-1 x.x.x.x:8050 check maxconn 5 inter 2s fall 10 observe layer4

backend splash-misc
    balance roundrobin
    # 将下面IP地址替换为你自己的Splash服务IP和端口
    # 按照以下格式一次增加其余的Splash服务器
    server splash-0 x.x.x.x:8050 check fall 15
    server splash-1 x.x.x.x:8050 check fall 15

Docker scrapy redis spalsh 爬虫踩坑

首先说个结论,很多问题都是墙引起的.

1.Docker如果不设置好proxy,用起来会非常痛苦.我用了Proxychains + 本地socks5代理,不然速度慢的让你有种被虐的感觉.当然,临时用也可以用export的方式,比如Ubuntu中可以用sudo apt-get -o Acquire::socks::proxy=”socks://127.0.0.1:8399/” update

2.Docker splash安装后会发现在Web端无法渲染有些网站,这里也是Github源码中用了一些境外的js,这些被墙了.对,没错,这些也能被墙.下面链接可以参考,主要是替换resources.py中的两个文件http://libs.baidu.com/jquery/1.11.1/jquery.min.js, https://ajax.aspnetcdn.com/ajax/jquery.migrate/jquery-migrate-1.2.1.js

docker ps

docker exec -u 0 -it ed39a8b02925 /bin/bash

https://blog.csdn.net/qq_42078965/article/details/108812636

说到代理,下面的代码适合socks5代理github,不然访问github太慢了:

git config --global http.proxy 'socks5://127.0.0.1:1080'
git config --global https.proxy 'socks5://127.0.0.1:1080'

3. WSL内存管理有问题,splash运行一段时间机器内存就会炸掉。目前的workaround是用脚本监控系统内存并重启container:

import psutil
import os 
import time
import subprocess

def id_find():
    tmp = str(subprocess.check_output('sudo docker ps', shell=True)).split('\\n')
    print(tmp)
    if tmp[1] ==  "'":
        print('no container_id exists')
        return 0
    else:        
        print('one container_id found')
        container_id = tmp[1].split(' ')[0]
        print('container_id 1:', container_id)
        return container_id    
    
while True:
    mem = psutil.virtual_memory()
    total = str(round(mem.total / 1024 / 1024))
    #round方法进行四舍五入,然后转换成字符串 字节/1024得到kb 再/1024得到M
    used = str(round(mem.used / 1024 / 1024))
    use_per = str(round(mem.percent))
    free = str(round(mem.free / 1024 / 1024))
    print("您当前的内存大小为:" + total + "M")
    print("已使用:" + used + "M(" + use_per + "%)")
    print("可用内存:" + free + "M")
    container_id = id_find()
    print('container_id return:', container_id)
    if container_id == 0:        
        time.sleep(3)
        os.system('sudo docker run -itd -v ~/default.ini -p 8050:8050 scrapinghub/splash /bin/bash --max-timeout=3600')
        time.sleep(5)
        # os.system('sudo docker container prune') # remove all stopped containers. CAUTION!!!!
    else:
        print('one container_id found: ', container_id)
        if int(use_per) > 90:
            os.system(f'sudo docker restart {container_id}')
            # os.system(f'sudo docker run -itd -v lagoucrawl\default.ini -p 8050:8050 scrapinghub/splash /bin/bash --max-timeout=600')
            time.sleep(3)
            # os.system('sudo docker container prune') # remove all stopped containers. CAUTION!!!!
        else:
            time.sleep(3)

这里的default.ini是我的多贝云代理文件,内容如下:

[proxy]
   
host=http-proxy-t1.dobel.cn
port=9180

username=BOSSZxxxxxxx
password=yyyyyyy

因为我用了scrapy-redis-splash这个框架,不能直接套用scrapy代理的方式,实践证明这个是有效也是最简单的splash proxy方法。

说到这儿,顺便说下docker的加速(以阿里云为例):

{
  "registry-mirrors": [
    "https://qbsssss.mirror.aliyuncs.com"
  ],
  "insecure-registries": [],
  "debug": false,
  "experimental": false,
  "features": {
    "buildkit": true
  }
}

上面的连接登录阿里云之后能找到,Google下即可。这里有个坑,如果你用国外阿里云账号登录,是死活找不到这个链接的!

因为GFW的缘故,国内很多技术人员获取最先进的技术资料是很困难的。仅仅说proxy,不同的平台设置都不一样,github, wget, docker。。。。。。即使是proxychains也不是一站式解决方案。

4. Redis看起来简单,也是很多坑需要踩的。比如,强制停止爬虫程序后重启之前可能需要删除redis中的requests,不然爬虫程序可能无法继续。