fix: 改进被封禁账号爬取及添加代理支持#694
Open
ydotdog wants to merge 416 commits into
Open
Conversation
config.json文件的user_id_list参数如果是文件路径,该路径即可以是文件的绝对地址,也可以是文件在**命令行当前目录**的相对地址。如,在/home/test目录执行程序,user_id_list.txt文件可以放在该目录下,它的相对路径即config.json中user_id_list参数的值是“user_id_list.txt” Issue dataabc#160
1. csv, txt, json, mongo, mysql writer verified. 2. img, video downloader verified.
更新MongoDB设置及相关文档
correct user's profile url when fetching weibo
perf: update absl-py lock version
The new url format are introduced by dataabc@a51e4a8
Fix the tool to run correctly when there are 2 pinned weibo.
Bumps [requests](https://github.com/psf/requests) from 2.23.0 to 2.31.0. - [Release notes](https://github.com/psf/requests/releases) - [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md) - [Commits](psf/requests@v2.23.0...v2.31.0) --- updated-dependencies: - dependency-name: requests dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>
….31.0 build(deps): bump requests from 2.23.0 to 2.31.0
Update the stale action rule to not mark issues with assignees as stale.
Fix the crawling of toutiao article urls.
issues_bug_574 优化获取微博长文
issues_feature_post_api_576 实现通过POST方式将数据推送到自定义接口
Bumps [tqdm](https://github.com/tqdm/tqdm) from 4.46.1 to 4.66.3. - [Release notes](https://github.com/tqdm/tqdm/releases) - [Commits](tqdm/tqdm@v4.46.1...v4.66.3) --- updated-dependencies: - dependency-name: tqdm dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>
build(deps): bump tqdm from 4.46.1 to 4.66.3
handle_html 函数返回 Element 或者 None,`page_parser.py` 没有处理 None 的情况
```py
'NoneType' object has no attribute 'xpath'
Traceback (most recent call last):
File "C:\Users\zw\miniconda3\lib\site-packages\weibo_spider\spider.py", line 178, in get_weibo_info
weibos, self.weibo_id_list, to_continue = PageParser(
File "C:\Users\zw\miniconda3\lib\site-packages\weibo_spider\parser\page_parser.py", line 47, in __init__
info = self.selector.xpath("//div[@Class='c']")
AttributeError: 'NoneType' object has no attribute 'xpath'
```
解决 'NoneType' object has no attribute 'xpath'
--- updated-dependencies: - dependency-name: requests dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>
….32.0 build(deps): bump requests from 2.31.0 to 2.32.0
1. 添加全局代理配置支持(config.json 中配置 proxy 字段) 2. handle_html 增加 403/432 等状态码的差异化处理和重试策略 3. info_parser 兼容自己查看自己资料页和查看他人资料页两种HTML结构 4. 学习经历/工作经历提取改用 following-sibling 定位,更健壮 5. page_parser 异常时返回空列表而非 None,避免上层解包报错 6. 文件下载和视频URL获取均支持代理
Owner
|
感谢贡献代码,本程序目的是为了帮助大家在学术研究时少量收集数据的,而不是大量爬取微博,所以没有添加代理支持的打算,就不合并代码了,请您理解。虽然代码没有合并,但也感谢您的热心贡献。 |
501028e to
8c49c81
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
爬取被微博封禁(block)的账号时,原有代码在多处会出现解析失败或请求被拒绝的问题。本 PR 做了以下改进:
config.json中添加proxy字段),所有 HTTP 请求(页面、文件下载、视频URL获取)均支持代理handle_html对 403(IP限制)采用更长等待重试,对 432(UA被拒)直接返回并提示更新 UAinfo_parser同时兼容查看他人资料页和自己查看自己资料页两种不同的 HTML 结构following-siblingXPath 定位,不再依赖硬编码的 div 索引,更加健壮page_parser.get_one_page异常时返回空列表而非隐式None,避免上层spider.py解包报错改动文件
weibo_spider/parser/util.pyweibo_spider/parser/info_parser.pyweibo_spider/parser/page_parser.pyweibo_spider/downloader/downloader.pyweibo_spider/spider.py使用方法
在
config.json中添加代理配置(可选):{ "proxy": "http://127.0.0.1:7890" }不配置
proxy字段时行为与原版完全一致。Test plan
🤖 Generated with Claude Code