pillow压缩图片尺寸并保持图片质量(crontab定时任务最强解析)

很久前不小心上传了一批大尺寸文件,虽然wordpress插件压缩优化了,但是源文件还是存在的。写了段代码,多线程pillow压缩图片尺寸,速度很快,毕竟大部分时间会耗在图片打开保存环节,多线程有提升效果。2000多张图片大概需要4秒钟,实际上十几万图片花费时间估计也差不多。

遍历文件夹用的os.walk,然后多线程去运行pillow压缩程序,比用插件强多了,各种优化插件只会拖慢速度,还不如写个脚本在后台直接压缩图片。

thumbnail相比resize还是有区别的,前者只能压缩不能增加尺寸,而且它能保持图片原来的比例,非常方便。当然,要记得antialias了,做了那么多年地震数据处理分析,自然晓得防假频的重要性了。

# the script was written to loop through a folder and resize all large pics to smaller sizes
# multi-threading is turned on to speed up process

import os
from PIL import Image
import sys
import threading
from sys import argv
import time

years = ['2019', '2020']
file_size = (1024, 768)

def resize(files):
    for file in files:
        if file.endswith('jpg') or file.endswith('png'):
            # print(os.path.abspath(file))
            # print(os.path.join(root,file))
            file_path = os.path.join(root,file)
            filesize = os.stat(file_path).st_size/1024
            if filesize > 200:
                print(file, file_path, filesize)
                img = Image.open(file_path)
                print(img.size)
                img.thumbnail(file_size,Image.ANTIALIAS)
                img = img.convert("RGB") # PNG CONVERT TO JPEG
                print(img.size)
                img.save(file_path, "JPEG")
                
                
def filewrite(filepath):
    with open('filesize_list.txt', 'a') as f:
        f.write(os.path.abspath(filepath))
        f.write('\n')

def sizeCheck(files):
    thr_name = threading.current_thread().name
    print(f'{thr_name} is processing {len(files)} files')
    for file in files:
        file_path = os.path.join(root,file)
        # file_path = os.path.abspath(file)
        filesize = os.stat(file_path).st_size/1024/1024
        if filesize > 2:
            print(f'threading: {file}, {file_path}, {int(filesize)}MB')
            # lock.acquire()
            filewrite(file_path)
            # lock.release()
            
def split_files(file_list, split_num):
    thread_list = []
    # list size each thread has to process
    list_size = (len(file_list) // split_num) if (len(file_list) % split_num == 0) else ((len(file_list) // split_num) + 1)
    print(f'num of files to check {list_size}')
    # start thread
    for i in range(split_num):
        # get url that current thread need to process
        file_list_split = file_list[
                         i * list_size:(i + 1) * list_size if len(file_list) > (i + 1) * list_size else len(file_list)]
        thread = threading.Thread(target=resize, args=(file_list_split,))
        thread.setName("Thread" + str(i))
        thread_list.append(thread)
        # start in thread
        thread.start()
        # print(thread.getName() + "started")
    # combine at the end of the job
    for _item in thread_list:
        _item.join()
        
if __name__ == "__main__":
    t1 = time.time()
    thread_num = 6
    lock = threading.Lock()
    print("add the directory where you want to check filesizes or leave it to default:") 
    if len(argv) > 1:
        dirs_to_check = [','.join(argv[i]) for i in range(len(argv))]        
        print(f'num of folders to check: {dirs_to_check}  {len(dirs_to_check)}')
    else:
        dirs_to_check = ['/www/wwwroot/geoseis.cn/wp-content/uploads/2019', '/www/wwwroot/geoseis.cn/wp-content/uploads/2020']
    file_list_ = []
    for dir_to_check in dirs_to_check:      
        print(f'dir_to_check {dir_to_check}')
        for root, dirs, files in os.walk(dir_to_check):
            # print(root, dirs, files, len(files))
            for i in files:
                file_list_.append(''.join(root + '/'+ i))
    print(f'num of files to scan: {len(file_list_)}')
    split_files(file_list_, thread_num) # thread_num
    t2 = time.time()
    print(f'time lapsed: {t2-t1}, num of threads used: {thread_num}')
            
                        
*/300  * * * *  sudo python /www/wwwroot/geoseis.cn/wp-content/uploads/resize_threading.py

用crontab做定时任务,每5小时运行一次,这里有个网站,实时测试你的crontab任务格式是否正确,强烈推荐! https://crontab.guru/#0_/1_

每天00:05执行
每300分钟执行一次
每周五的05:10执行

crawlab爬虫管理平台的配置(数据持久化和密码访问)

docker安装和配置crawlab

crawlab仅仅上线一年多,在github上就收到了7000+星,可以看出市场上急缺这么一款多功能的爬虫管理平台。首先一睹芳容:

crawlab demo(应该是专业版),详情:https://docs.crawlab.cn/zh/Monitor/
我刚搭建的crawlab(可以看出社区版被阉割过)

其实这个监控功能不难实现,portainer的功能更全面。

和专业版相比,社区版阉割的最重要的功能其实是对数据源的支持,比如最受欢迎的mysql,这点只能付费买专业版,比较蛋疼。

官网配置说明比较详细,重点说说一些个性化配置(尤其是密码访问这块)。安装默认配置后是免密登录的,也就是说你的redis, mongodb谁都能访问(只需要你的IP,PORT),所以demo测试完就要折腾密码问题了。这一点官网没有说明,我觉得对于很多不太熟悉Docker的人来说不太友好,我简单说明下。

version: '3.3'
services:
  master:
    image: tikazyq/crawlab:latest
    container_name: master
    environment:
      # CRAWLAB_API_ADDRESS: "https://<your_api_ip>:<your_api_port>"  # backend API address 后端 API 地址. 适用于 https 或者源码部署
      CRAWLAB_SERVER_MASTER: "Y"  # whether to be master node 是否为主节点,主节点为 Y,工作节点为 N
      CRAWLAB_MONGO_HOST: "mongo"  # MongoDB host address MongoDB 的地址,在 docker compose 网络中,直接引用服务名称
      CRAWLAB_MONGO_PORT: "27017"  # MongoDB port MongoDB 的端口
      CRAWLAB_MONGO_DB: "crawlab_test"  # MongoDB database MongoDB 的数据库
      CRAWLAB_MONGO_USERNAME: "user"  # MongoDB username MongoDB 的用户名
      CRAWLAB_MONGO_PASSWORD: "pass"  # MongoDB password MongoDB 的密码
      # CRAWLAB_MONGO_AUTHSOURCE: "admin"  # MongoDB auth source MongoDB 的验证源
      CRAWLAB_REDIS_ADDRESS: "redis"  # Redis host address Redis 的地址,在 docker compose 网络中,直接引用服务名称
      CRAWLAB_REDIS_PORT: "6379"  # Redis port Redis 的端口
      CRAWLAB_REDIS_DATABASE: "1"  # Redis database Redis 的数据库
      CRAWLAB_REDIS_PASSWORD: "redispasswd"  # Redis password Redis 的密码
      # CRAWLAB_LOG_LEVEL: "info"  # log level 日志级别. 默认为 info
      # CRAWLAB_LOG_ISDELETEPERIODICALLY: "N"  # whether to periodically delete log files 是否周期性删除日志文件. 默认不删除
      # CRAWLAB_LOG_DELETEFREQUENCY: "@hourly"  # frequency of deleting log files 删除日志文件的频率. 默认为每小时
      CRAWLAB_TASK_WORKERS: 8  # number of task executors 任务执行器个数(并行执行任务数)
      CRAWLAB_SERVER_REGISTER_TYPE: "mac"  # node register type 节点注册方式. 默认为 mac 地址,也可设置为 ip(防止 mac 地址冲突)
      # CRAWLAB_SERVER_REGISTER_IP: "127.0.0.1"  # node register ip 节点注册IP. 节点唯一识别号,只有当 CRAWLAB_SERVER_REGISTER_TYPE 为 "ip" 时才生效
      # CRAWLAB_SERVER_LANG_NODE: "Y"  # whether to pre-install Node.js 预安装 Node.js 语言环境
      # CRAWLAB_SERVER_LANG_JAVA: "Y"  # whether to pre-install Java 预安装 Java 语言环境
      # CRAWLAB_SERVER_LANG_DOTNET: "Y"  # whether to pre-install .Net core 预安装 .Net Core 语言环境
      # CRAWLAB_SERVER_LANG_PHP: "Y"  # whether to pre-install PHP 预安装 PHP 语言环境
      # CRAWLAB_SERVER_LANG_GO: "Y"  # whether to pre-install Golang 预安装 Golang 语言环境
      # CRAWLAB_SETTING_ALLOWREGISTER: "N"  # whether to allow user registration 是否允许用户注册
      # CRAWLAB_SETTING_ENABLETUTORIAL: "N"  # whether to enable tutorial 是否启用教程
      CRAWLAB_SETTING_RUNONMASTER: "Y"  # whether to run on master node 是否在主节点上运行任务
      CRAWLAB_SETTING_DEMOSPIDERS: "Y"  # whether to init demo spiders 是否使用Demo爬虫
      CRAWLAB_SETTING_CHECKSCRAPY: "Y"  # whether to automatically check if the spider is scrapy 是否自动检测爬虫为scrapy
      # CRAWLAB_NOTIFICATION_MAIL_SERVER: smtp.exmaple.com  # STMP server address STMP 服务器地址
      # CRAWLAB_NOTIFICATION_MAIL_PORT: 465  # STMP server port STMP 服务器端口
      # CRAWLAB_NOTIFICATION_MAIL_SENDEREMAIL: admin@exmaple.com  # sender email 发送者邮箱
      # CRAWLAB_NOTIFICATION_MAIL_SENDERIDENTITY: admin@exmaple.com  # sender ID 发送者 ID
      # CRAWLAB_NOTIFICATION_MAIL_SMTP_USER: username  # SMTP username SMTP 用户名
      # CRAWLAB_NOTIFICATION_MAIL_SMTP_PASSWORD: password  # SMTP password SMTP 密码

上面主节点的配置中需要注意的是,环境变量的密码设置是用来通信的,和下面的工作节点以及redis/mongodb的密码要一致:

  worker:
    image: tikazyq/crawlab:latest
    container_name: worker
    environment:
      CRAWLAB_SERVER_MASTER: "N"
      CRAWLAB_MONGO_HOST: "mongo"
      CRAWLAB_MONGO_USERNAME: "user"  # MongoDB username MongoDB 的用户名
      CRAWLAB_MONGO_PASSWORD: "password"  # MongoDB password MongoDB 的密码
      CRAWLAB_REDIS_ADDRESS: "redis"
      CRAWLAB_REDIS_PASSWORD: "redispassword"  # 这个一定要有不然crawlab无法获取工作节点信# 息,crawlab需要依赖于redis通信和管理节点,原配置中没有这项,当然英语好你看Log也能琢磨出来为什# 么出错
    depends_on:
      - mongo
      - redis
    # volumes:
    #   - "/var/crawlab/log:/var/logs/crawlab" # log persistent 日志持久化
  mongo:
    image: mongo:latest
    restart: always
    environment:
      MONGO_INITDB_ROOT_USERNAME: "user"
      MONGO_INITDB_ROOT_PASSWORD: "password"
    volumes:
      - "/home/chen/crawlab/mongo/data/db:/data/db"  # make data persistent 持久化
    ports:
      - "27017:27017"  # expose port to host machine 暴露接口到宿主机
  redis:
    image: redis:latest
    restart: always
    # command: redis-server ./redis.conf/redis.conf 
    # command: redis-server /etc/redis/redis.conf --requirepass "redispassword"
    command: redis-server --requirepass "redispassword"
    # command: redis-server --requirepass "redispassword" # set redis password 设置 Redis 密码
    volumes:
      # - "/opt/crawlab/redis/data:/data"  # make data persistent 持久化
      # - ./conf/redis/redis.conf:/etc/redis/redis.conf   
      - "/home/chen/crawlab/data/dump.rdb:/data/dump.rdb"
    ports:
      - "6379:6379"  # expose port to host machine 暴露接口到宿主机

默认配置中没有工作节点密码设置,这个缺失会造成crawlab无法获取节点信息的。另外,虽然经过一些周折解决了挂载本地目录的redis conf和数据持久化问题,但是开始没能加载本地的redis rdb备份,也就是说这个容器的redis是个全新的。后来发现加载本地目录rdb的方式不对,应该为”/home/chen/crawlab/data/dump.rdb:/data/dump.rdb”,本地rdb的绝对路径map到/data/dump.rdb,github上有相关的一个Issue: https://github.com/docker-library/redis/issues/77

还有个问题就是,Docker安装的mongodb登录方式肯定特别:

docker exec -it  crawlab_mongo_1 mongo -u USER -p PASSWORD --authenticationDatabase admin
这里的docker exec -it就是交互式登录容器,crawlab_mongo_1 是这个容器的name,可以用docker ps获取(当然用CONTAINER ID应该也没问题),比如:
(boss) [chen@VM_0_2_centos crawlab]$ docker ps
CONTAINER ID        IMAGE                       COMMAND                  CREATED             STATUS              PORTS                              NAMES
0a391aea4fb7        mongo:latest                "docker-entrypoint.s…"   14 hours ago        Up 12 hours         0.0.0.0:27017->27017/tcp           crawlab_mongo_1

后面的–authenticationDatabase admin 是必需的,不然你登录进去没有admin权限。不要小看这个细节,如果你是初学者,这种小细节往往能卡住你很久。好好用Google吧,远离百度。什么,没法用Google?算了,你还是别搞专业了,我认为一个IT专业人员最起码的技能是善用Google, stackoverflow, github,如果连这堵墙你都翻不过去,说明你真的不适合搞技术。