Skip to content

fix: 改进被封禁账号爬取及添加代理支持#694

Open
ydotdog wants to merge 416 commits into
dataabc:masterfrom
ydotdog:fix/handle-blocked-accounts-and-proxy-support
Open

fix: 改进被封禁账号爬取及添加代理支持#694
ydotdog wants to merge 416 commits into
dataabc:masterfrom
ydotdog:fix/handle-blocked-accounts-and-proxy-support

Conversation

@ydotdog
Copy link
Copy Markdown

@ydotdog ydotdog commented Mar 24, 2026

Summary

爬取被微博封禁(block)的账号时,原有代码在多处会出现解析失败或请求被拒绝的问题。本 PR 做了以下改进:

  • 代理支持:新增全局代理配置(config.json 中添加 proxy 字段),所有 HTTP 请求(页面、文件下载、视频URL获取)均支持代理
  • HTTP 错误差异化处理handle_html 对 403(IP限制)采用更长等待重试,对 432(UA被拒)直接返回并提示更新 UA
  • 用户信息解析兼容info_parser 同时兼容查看他人资料页和自己查看自己资料页两种不同的 HTML 结构
  • 学习/工作经历提取:改用 following-sibling XPath 定位,不再依赖硬编码的 div 索引,更加健壮
  • 异常处理修复page_parser.get_one_page 异常时返回空列表而非隐式 None,避免上层 spider.py 解包报错

改动文件

文件 说明
weibo_spider/parser/util.py 添加代理配置函数、DEFAULT_UA 常量、403/432 处理
weibo_spider/parser/info_parser.py 重构用户信息解析,兼容两种页面结构
weibo_spider/parser/page_parser.py 异常处理增加 return 语句
weibo_spider/downloader/downloader.py 文件下载支持代理
weibo_spider/spider.py 从 config 读取并初始化代理

使用方法

config.json 中添加代理配置(可选):

{
    "proxy": "http://127.0.0.1:7890"
}

不配置 proxy 字段时行为与原版完全一致。

Test plan

  • 不配置代理时,功能与原版一致
  • 配置代理后,所有请求通过代理发出
  • 爬取被封禁账号时,info_parser 正确解析用户信息
  • 页面请求遇到 403 时自动重试
  • page_parser 异常时不会导致程序崩溃

🤖 Generated with Claude Code

dataabc and others added 30 commits May 31, 2020 18:19
config.json文件的user_id_list参数如果是文件路径,该路径即可以是文件的绝对地址,也可以是文件在**命令行当前目录**的相对地址。如,在/home/test目录执行程序,user_id_list.txt文件可以放在该目录下,它的相对路径即config.json中user_id_list参数的值是“user_id_list.txt”

Issue dataabc#160
1. csv, txt, json, mongo, mysql writer verified.
2. img, video downloader verified.
jerrylaikr and others added 28 commits July 29, 2022 17:29
correct user's profile url when fetching weibo
perf: update absl-py lock version
Fix the tool to run correctly when there are 2 pinned weibo.
Bumps [requests](https://github.com/psf/requests) from 2.23.0 to 2.31.0.
- [Release notes](https://github.com/psf/requests/releases)
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md)
- [Commits](psf/requests@v2.23.0...v2.31.0)

---
updated-dependencies:
- dependency-name: requests
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
….31.0

build(deps): bump requests from 2.23.0 to 2.31.0
Update the stale action rule to not mark issues with assignees as stale.
Fix the crawling of toutiao article urls.
issues_bug_574 优化获取微博长文
issues_feature_post_api_576 实现通过POST方式将数据推送到自定义接口
Bumps [tqdm](https://github.com/tqdm/tqdm) from 4.46.1 to 4.66.3.
- [Release notes](https://github.com/tqdm/tqdm/releases)
- [Commits](tqdm/tqdm@v4.46.1...v4.66.3)

---
updated-dependencies:
- dependency-name: tqdm
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
build(deps): bump tqdm from 4.46.1 to 4.66.3
handle_html 函数返回 Element 或者 None,`page_parser.py` 没有处理 None 的情况
```py
'NoneType' object has no attribute 'xpath'
Traceback (most recent call last):
  File "C:\Users\zw\miniconda3\lib\site-packages\weibo_spider\spider.py", line 178, in get_weibo_info
    weibos, self.weibo_id_list, to_continue = PageParser(
  File "C:\Users\zw\miniconda3\lib\site-packages\weibo_spider\parser\page_parser.py", line 47, in __init__
    info = self.selector.xpath("//div[@Class='c']")
AttributeError: 'NoneType' object has no attribute 'xpath'
```
解决 'NoneType' object has no attribute 'xpath'
---
updated-dependencies:
- dependency-name: requests
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
….32.0

build(deps): bump requests from 2.31.0 to 2.32.0
1. 添加全局代理配置支持(config.json 中配置 proxy 字段)
2. handle_html 增加 403/432 等状态码的差异化处理和重试策略
3. info_parser 兼容自己查看自己资料页和查看他人资料页两种HTML结构
4. 学习经历/工作经历提取改用 following-sibling 定位,更健壮
5. page_parser 异常时返回空列表而非 None,避免上层解包报错
6. 文件下载和视频URL获取均支持代理
@dataabc
Copy link
Copy Markdown
Owner

dataabc commented Mar 24, 2026

感谢贡献代码,本程序目的是为了帮助大家在学术研究时少量收集数据的,而不是大量爬取微博,所以没有添加代理支持的打算,就不合并代码了,请您理解。虽然代码没有合并,但也感谢您的热心贡献。

@ydotdog ydotdog force-pushed the fix/handle-blocked-accounts-and-proxy-support branch from 501028e to 8c49c81 Compare April 23, 2026 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.