stackoverflow这个帖子写的很清楚了:I crawl a webpage page by page, by following urls in pagination. In some pages, website detects that I use a bot and gives me an error in html. Since it is a successful request, it caches the page and when I run it again, I get the same error.
爬虫最常见的场景就是被反爬,scrapy开启Httpcache之后会在本地(项目目录下的.scrapy文件夹)缓存请求和响应进而加速爬虫,但是问题是你被反爬重试时这个缓存并没有更新,所以你得到的解析页面还是被反爬时的页,总是得不到数据。所以这里有必要重写中间件来刷新这个缓存。