fix: 改进被封禁账号爬取及添加代理支持 by ydotdog · Pull Request #694 · dataabc/weiboSpider

ydotdog · 2026-03-24T07:22:38Z

Summary

爬取被微博封禁(block)的账号时，原有代码在多处会出现解析失败或请求被拒绝的问题。本 PR 做了以下改进：

代理支持：新增全局代理配置（config.json 中添加 proxy 字段），所有 HTTP 请求（页面、文件下载、视频URL获取）均支持代理
HTTP 错误差异化处理：handle_html 对 403（IP限制）采用更长等待重试，对 432（UA被拒）直接返回并提示更新 UA
用户信息解析兼容：info_parser 同时兼容查看他人资料页和自己查看自己资料页两种不同的 HTML 结构
学习/工作经历提取：改用 following-sibling XPath 定位，不再依赖硬编码的 div 索引，更加健壮
异常处理修复：page_parser.get_one_page 异常时返回空列表而非隐式 None，避免上层 spider.py 解包报错

改动文件

文件	说明
`weibo_spider/parser/util.py`	添加代理配置函数、DEFAULT_UA 常量、403/432 处理
`weibo_spider/parser/info_parser.py`	重构用户信息解析，兼容两种页面结构
`weibo_spider/parser/page_parser.py`	异常处理增加 return 语句
`weibo_spider/downloader/downloader.py`	文件下载支持代理
`weibo_spider/spider.py`	从 config 读取并初始化代理

使用方法

在 config.json 中添加代理配置（可选）：

{
    "proxy": "http://127.0.0.1:7890"
}

不配置 proxy 字段时行为与原版完全一致。

Test plan

不配置代理时，功能与原版一致
配置代理后，所有请求通过代理发出
爬取被封禁账号时，info_parser 正确解析用户信息
页面请求遇到 403 时自动重试
page_parser 异常时不会导致程序崩溃

🤖 Generated with Claude Code

config.json文件的user_id_list参数如果是文件路径，该路径即可以是文件的绝对地址，也可以是文件在**命令行当前目录**的相对地址。如，在/home/test目录执行程序，user_id_list.txt文件可以放在该目录下，它的相对路径即config.json中user_id_list参数的值是“user_id_list.txt” Issue dataabc#160

Issue dataabc#160

1. csv, txt, json, mongo, mysql writer verified. 2. img, video downloader verified.

更新MongoDB设置及相关文档

Issue#482

correct user's profile url when fetching weibo

perf: update absl-py lock version

The new url format are introduced by dataabc@a51e4a8

Fix the tool to run correctly when there are 2 pinned weibo.

Bumps [requests](https://github.com/psf/requests) from 2.23.0 to 2.31.0. - [Release notes](https://github.com/psf/requests/releases) - [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md) - [Commits](psf/requests@v2.23.0...v2.31.0) --- updated-dependencies: - dependency-name: requests dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>

….31.0 build(deps): bump requests from 2.23.0 to 2.31.0

Reference: https://github.com/probot/stale

Update the stale action rule to not mark issues with assignees as stale.

Fix the crawling of toutiao article urls.

issues_bug_574 优化获取微博长文

issues_feature_post_api_576 实现通过POST方式将数据推送到自定义接口

Bumps [tqdm](https://github.com/tqdm/tqdm) from 4.46.1 to 4.66.3. - [Release notes](https://github.com/tqdm/tqdm/releases) - [Commits](tqdm/tqdm@v4.46.1...v4.66.3) --- updated-dependencies: - dependency-name: tqdm dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>

build(deps): bump tqdm from 4.46.1 to 4.66.3

@Class

handle_html 函数返回 Element 或者 None，`page_parser.py` 没有处理 None 的情况 ```py 'NoneType' object has no attribute 'xpath' Traceback (most recent call last): File "C:\Users\zw\miniconda3\lib\site-packages\weibo_spider\spider.py", line 178, in get_weibo_info weibos, self.weibo_id_list, to_continue = PageParser( File "C:\Users\zw\miniconda3\lib\site-packages\weibo_spider\parser\page_parser.py", line 47, in __init__ info = self.selector.xpath("//div[@Class='c']") AttributeError: 'NoneType' object has no attribute 'xpath' ```

解决 'NoneType' object has no attribute 'xpath'

--- updated-dependencies: - dependency-name: requests dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>

….32.0 build(deps): bump requests from 2.31.0 to 2.32.0

1. 添加全局代理配置支持（config.json 中配置 proxy 字段） 2. handle_html 增加 403/432 等状态码的差异化处理和重试策略 3. info_parser 兼容自己查看自己资料页和查看他人资料页两种HTML结构 4. 学习经历/工作经历提取改用 following-sibling 定位，更健壮 5. page_parser 异常时返回空列表而非 None，避免上层解包报错 6. 文件下载和视频URL获取均支持代理

dataabc · 2026-03-24T14:04:32Z

感谢贡献代码，本程序目的是为了帮助大家在学术研究时少量收集数据的，而不是大量爬取微博，所以没有添加代理支持的打算，就不合并代码了，请您理解。虽然代码没有合并，但也感谢您的热心贡献。

dataabc and others added 30 commits May 31, 2020 18:19

Update README.md

48bbc0c

Update README.md

2f054e4

perf: 优化config.json文件读取

48a10b0

Update README.md

11ddfaf

fix: 修复只能将结果文件写入程序所在目录的bug

9da8f68

Issue dataabc#160

fix: 修复log.txt只能写入程序目录的bug

4c276c3

Code refactor. related dataabc#160

93b48ee

1. csv, txt, json, mongo, mysql writer verified. 2. img, video downloader verified.

perf: 优化用户提示

ab4559c

Merge branch 'master' of github.com:dataabc/weiboSpider into multi-file

b9bf4f0

serveral bug fix

592496b

Update README.md

7f0a82c

fix: 修复模块路径问题

837e616

git rollback

842ae75

Update README.md

33376d1

fix: 修复pypi版无法添加配置文件的问题

03103a7

fix: 修复运行程序必须安装pymysql的问题

fdf32bf

perf: 删除python2相关代码

57a541d

fix: 修复爬取数目统计不准确的bug

c6ffb68

Create CONTRIBUTING.md

d0d9123

Update CONTRIBUTING.md

14e3b38

Update CONTRIBUTING.md

7fc570c

Update CONTRIBUTING.md

6caab55

Update CONTRIBUTING.md

0688fcf

docs: 优化说明文档结构

9541088

Merge branch 'master' of https://github.com/dataabc/weiboSpider

6b54b93

Create settings.md

c443934

Create automation.md

d408d8a

Update README.md

3802319

Update example.md

661158b

jerrylaikr and others added 28 commits July 29, 2022 17:29

update mongo_config

1e879cf

Merge pull request dataabc#462 from jerrylaikr/fix-mongo-no-format

3c8b5e5

更新MongoDB设置及相关文档

correct user's profile url when fetching weibo

a51e4a8

Issue#482

Merge pull request dataabc#483 from huaiwens/profile_url

e144dcb

correct user's profile url when fetching weibo

perf: update absl-py lock version

2a1400b

Merge pull request dataabc#492 from linbuxiao/update-deps

6b68775

perf: update absl-py lock version

Fix dataabc#484: run correctly when there are 2 pinned weibo.

6d4f5f9

Fix: fix pytest by fixing the url map of testdata.

55ddb32

The new url format are introduced by dataabc@a51e4a8

Merge pull request dataabc#525 from songzy12/master

5e1d06d

Fix the tool to run correctly when there are 2 pinned weibo.

Merge pull request dataabc#526 from dataabc/dependabot/pip/requests-2…

678154f

….31.0 build(deps): bump requests from 2.23.0 to 2.31.0

Update the stale action rule to not mark issues with assignees as stale.

6ec6882

Reference: https://github.com/probot/stale

Merge pull request dataabc#530 from songzy12/master

ae750bd

Update the stale action rule to not mark issues with assignees as stale.

Fix the crawling of toutiao article urls.

829a891

Merge pull request dataabc#536 from songzy12/ttarticle

22d8d03

Fix the crawling of toutiao article urls.

issues_bug_574 无法匹配获取微博长文，尝试修复

ac550e0

issues_bug_574 无法匹配获取微博长文，尝试修复

8c4eb7f

Merge pull request dataabc#575 from myshero/issues_bug_574

3098e61

issues_bug_574 优化获取微博长文

issues_feature_post_api_576 实现通过POST方式将数据推送到自定义接口

2c7f723

issues_feature_post_api_576 实现通过POST方式将数据推送到自定义接口

c9cf218

Merge pull request dataabc#577 from myshero/feature-post-to-custom-api

8ce5f32

issues_feature_post_api_576 实现通过POST方式将数据推送到自定义接口

Merge pull request dataabc#580 from dataabc/dependabot/pip/tqdm-4.66.3

4cbe86a

build(deps): bump tqdm from 4.46.1 to 4.66.3

Merge pull request dataabc#583 from zangwill/patch-1

bf90933

解决 'NoneType' object has no attribute 'xpath'

496c09c

--- updated-dependencies: - dependency-name: requests dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>

Merge pull request dataabc#585 from dataabc/dependabot/pip/requests-2…

c1d43e8

….32.0 build(deps): bump requests from 2.31.0 to 2.32.0

ydotdog force-pushed the fix/handle-blocked-accounts-and-proxy-support branch from 501028e to 8c49c81 Compare April 23, 2026 21:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: 改进被封禁账号爬取及添加代理支持#694

fix: 改进被封禁账号爬取及添加代理支持#694
ydotdog wants to merge 416 commits into
dataabc:masterfrom
ydotdog:fix/handle-blocked-accounts-and-proxy-support

ydotdog commented Mar 24, 2026 •

edited

Loading

Uh oh!

dataabc commented Mar 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

Conversation

ydotdog commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

改动文件

使用方法

Test plan

Uh oh!

dataabc commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

ydotdog commented Mar 24, 2026 •

edited

Loading

dataabc commented Mar 24, 2026 •

edited

Loading