scrapy 302重定向解决方法

解决了scrapy 302重定向拿不到数据的问题

今天在写京东爬虫遇到了重定向问题,加了headers(ua也加了)还是不解决问题。每个网站的问题都不同,像拉钩,BOSS直聘可能对referer有要求,而京东即使加了referer也无济于事。暂时的解决办法是:

  • 在Scrapy中的Request中添加 dont_filter=True ,因为Scrapy是默认过滤掉重复的请求URL,添加上参数之后即使被重定向了也能请求到正常的数据了。
  • 在Scrapy框架中的 settings.py文件里添加
HTTPERROR_ALLOWED_CODES = [301]

参考:https://www.pythonf.cn/read/154169

这个其实并没有真正解决问题,302重定向还是存在,只不过重试拿到了数据,代价就是每个请求都要两次才能拿到数据。我估计加上cookie有可能解决问题,但是加cookie可能会带来更多问题。

5秒钟揪出电脑中大文件 – python多线程遍历文件夹

python多线程遍历文件夹, 获取文件大小, list 线程安全

电脑用久了,自然疏于管理大文件,不经意间硬盘就告急了。这里祭出大招,5秒钟遍历13万文件,揪出你想找的大文件!这里用了多线程,测试中发现,线程数超过4就意义不大了,毕竟只有十几万文件。

虽然脚本只有几十行,但是这个多线程的使用很有意义,解决了python list 线程安全这个问题

# the script was written to loop through a folder check sizes of all files
# multi-threading is turned on to speed up process

import os
import threading
from sys import argv
import time

years = ['2019', '2020']

def filewrite(filepath):
    with open('filesize_list.txt', 'a') as f:
        f.write(os.path.abspath(filepath))
        f.write('\n')

def sizeCheck(files):
    thr_name = threading.current_thread().name
    print(f'{thr_name} is processing {len(files)} files')
    for file in files:
        file_path = os.path.join(root,file)
        # file_path = os.path.abspath(file)
        filesize = os.stat(file_path).st_size/1024/1024
        if filesize > 3072:
            print(f'threading: {file}, {file_path}, {int(filesize)}MB')
            # lock.acquire()
            filewrite(file_path)
            # lock.release()
            
def split_files(file_list, split_num):
    thread_list = []
    # list size each thread has to process
    list_size = (len(file_list) // split_num) if (len(file_list) % split_num == 0) else ((len(file_list) // split_num) + 1)
    print(f'num of files to check {list_size}')
    # start thread
    for i in range(split_num):
        # get url that current thread need to process
        file_list_split = file_list[
                         i * list_size:(i + 1) * list_size if len(file_list) > (i + 1) * list_size else len(file_list)]
        thread = threading.Thread(target=sizeCheck, args=(file_list_split,))
        thread.setName("Thread" + str(i))
        thread_list.append(thread)
        # start in thread
        thread.start()
        # print(thread.getName() + "started")
    # combine at the end of the job
    for _item in thread_list:
        _item.join()
        
if __name__ == "__main__":
    t1 = time.time()
    thread_num = 6
    lock = threading.Lock()
    print("add the directory where you want to check filesizes or leave it to default:") 
    if len(argv) > 1:
        dirs_to_check = [','.join(argv[i]) for i in range(len(argv))]        
        print(f'num of folders to check: {dirs_to_check}  {len(dirs_to_check)}')
    else:
        dirs_to_check = ['D:\\', 'E:\\', 'F:\\', 'G:\\']
    file_list_ = []
    for dir_to_check in dirs_to_check:      
        print(f'dir_to_check {dir_to_check}')
        for root, dirs, files in os.walk(dir_to_check):
            # print(root, dirs, files, len(files))
            for i in files:
                file_list_.append(''.join(root + '\\'+ i))
    print(f'num of files to scan: {len(file_list_)}')
    split_files(file_list_, thread_num) # thread_num
    t2 = time.time()
    print(f'time lapsed: {t2-t1}, num of threads used: {thread_num}')
            
       
5秒钟遍历13万文件