上一篇博文我简单介绍了302重定向的问题和解决方案,但是那是一个折衷,并不是最好的方案,最佳方案当然是找到302重定向的原因解决问题。来分析下京东这段js代码:
function jump_mobile() {
if(is_sort_black_list()) {
return;
}
var userAgent = navigator.userAgent || "";
userAgent = userAgent.toUpperCase();
if(userAgent == "" || userAgent.indexOf("PAD") > -1) {
return;
}
if(window.location.hash == '#m') {
var exp = new Date();
exp.setTime(exp.getTime() + 30 * 24 * 60 * 60 * 1000);
document.cookie = "pcm=1;expires=" + exp.toGMTString() + ";path=/;domain=jd.com";
window.showtouchurl = true;
return;
}
if (/MOBILE/.test(userAgent) && /(MICROMESSENGER|QQ\/)/.test(userAgent)) {
var paramIndex = location.href.indexOf("?");
window.location.href = "//item.m.jd.com/product/11494732.html"+(paramIndex>0?location.href.substring(paramIndex,location.href.length):'');
return;
}
其实挺简单,就是根据user_agent来判断你的平台,如果是移动平台会强制302重定向到移动页面。问题是移动页面通常信息不全,不能满足我们的爬虫要求。很多人会想到伪造请求头,把user_agent带上。想起来很简单,问题是很多时候你伪造的user_agent并不好使,尤其是对于京东这段js代码。难点在于伪造一个pc平台的user_agent。
我试过多种方案,用过多个包,包括fake_ua, user_agent。后来发现只有user_agent能满足要求,它的算法很容易理解,选择系统平台,核心代码如下:
https://github.com/lorien/user_agent/blob/master/user_agent/base.py
如果你不喜欢安装包,可以用我模仿user_agent包的代码:
from user_agent import generate_user_agent, generate_navigator
from pprint import pprint
import random
# generate_user_agent()
generate_user_agent(os='win', navigator='chrome',
device_type='desktop')
# 'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.120 Safari/537.36'
CHROME_BUILD = '''
80.0.3987.132
80.0.3987.149
80.0.3987.99
81.0.4044.117
81.0.4044.138
83.0.4103.101
83.0.4103.106
83.0.4103.96
84.0.4147.105
84.0.4147.111
84.0.4147.125
84.0.4147.135
84.0.4147.89
85.0.4183.101
85.0.4183.102
85.0.4183.120
85.0.4183.121
85.0.4183.127
85.0.4183.81
85.0.4183.83
86.0.4240.110
86.0.4240.111
86.0.4240.114
86.0.4240.183
86.0.4240.185
86.0.4240.75
86.0.4240.78
86.0.4240.80
86.0.4240.96
86.0.4240.99
'''.strip().splitlines()
OS_PLATFORM = {
'win': (
'Windows NT 5.1', # Windows XP
'Windows NT 6.1', # Windows 7
'Windows NT 6.2', # Windows 8
'Windows NT 6.3', # Windows 8.1
'Windows NT 10.0', # Windows 10
),
'mac': (
'Macintosh; Intel Mac OS X 10.8',
'Macintosh; Intel Mac OS X 10.9',
'Macintosh; Intel Mac OS X 10.10',
'Macintosh; Intel Mac OS X 10.11',
'Macintosh; Intel Mac OS X 10.12',
),
}
OS_CPU = {
'win': (
'', # 32bit
'Win64; x64', # 64bit
'WOW64', # 32bit process on 64bit system
)}
platform = random.choice(OS_PLATFORM["win"])
cpu = random.choice(OS_CPU["win"])
chrome_build = random.choice(CHROME_BUILD)
if len(cpu) > 0:
tmp = (f'Mozilla/5.0 ({platform}; {cpu}) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/{chrome_build} Safari/537.36')
else:
tmp = (f'Mozilla/5.0 ({platform}) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/{chrome_build} Safari/537.36')
print(tmp)
经测试,京东再也不会302重定向了。