Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions pr_agent/tools/pr_similar_issue.py
Original file line number Diff line number Diff line change
Expand Up @@ -445,7 +445,7 @@ def _update_index_with_issues(self, issues_list, repo_name_for_index, upsert=Fal
if len(comment_body) < 8000 or \
self.token_handler.count_tokens(comment_body) < MAX_TOKENS[MODEL]:
comment_record = Record(
id=issue_key + ".comment_" + str(j + 1),
id=issue_key + ".comment_" + str(j),
text=comment_body,
metadata=Metadata(repo=repo_name_for_index,
username=username, # use issue username for all comments
Expand Down Expand Up @@ -541,7 +541,7 @@ def _update_table_with_issues(self, issues_list, repo_name_for_index, ingest=Fal
if len(comment_body) < 8000 or \
self.token_handler.count_tokens(comment_body) < MAX_TOKENS[MODEL]:
comment_record = Record(
id=issue_key + ".comment_" + str(j + 1),
id=issue_key + ".comment_" + str(j),
text=comment_body,
metadata=Metadata(repo=repo_name_for_index,
username=username, # use issue username for all comments
Expand Down Expand Up @@ -639,7 +639,7 @@ def _update_qdrant_with_issues(self, issues_list, repo_name_for_index, ingest=Fa
if len(comment_body) < 8000 or \
self.token_handler.count_tokens(comment_body) < MAX_TOKENS[MODEL]:
comment_record = Record(
id=issue_key + ".comment_" + str(j + 1),
id=issue_key + ".comment_" + str(j),
text=comment_body,
metadata=Metadata(repo=repo_name_for_index,
username=username,
Expand Down Expand Up @@ -673,7 +673,7 @@ def _update_qdrant_with_issues(self, issues_list, repo_name_for_index, ingest=Fa
points = []
for row in df.to_dict(orient="records"):
points.append(
PointStruct(id=uuid.uuid5(uuid.NAMESPACE_DNS, row["id"]).hex, vector=row["vector"], payload={"id": row["id"], "text": row["text"], "metadata": row["metadata"]})
PointStruct(id=uuid.uuid5(uuid.NAMESPACE_DNS, f"{repo_name_for_index}:{row['id']}").hex, vector=row["vector"], payload={"id": row["id"], "text": row["text"], "metadata": row["metadata"]})
Comment thread
qodo-free-for-open-source-projects[bot] marked this conversation as resolved.
Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

3. Qdrant upsert creates duplicates 🐞 Bug ≡ Correctness

Changing Qdrant PointStruct.id generation to include repo_name_for_index alters point IDs for
existing repos, so re-ingesting into an existing collection will add a second copy of each point
instead of overwriting the old ones. This can reduce result diversity (duplicates consume top_k)
because querying/parsing uses payload["id"], not the Qdrant point ID.
Agent Prompt
### Issue description
Qdrant point IDs changed (UUID seed now includes `repo_name_for_index`), so re-ingesting an already-indexed repo into an existing collection will create duplicate points rather than overwrite the existing ones.

### Issue Context
- `_update_qdrant_with_issues()` always calls `qdrant.upsert(...)` with newly generated point IDs.
- When re-indexing (e.g., `force_update_dataset`), the code does not delete existing points for that repo first.
- Querying later reads `payload['id']`, so duplicates remain queryable and can consume the top_k.

### Fix Focus Areas
- pr_agent/tools/pr_similar_issue.py[212-257]
- pr_agent/tools/pr_similar_issue.py[587-679]

### What to change
1. Before calling `_update_qdrant_with_issues(..., ingest=True)` for an already-existing collection, delete existing points for that repo using a filter on `metadata.repo == repo_name_for_index` (and optionally `level in {issue,comment}` if needed).
2. Alternatively (or additionally), set `PointStruct.id` to a stable, human-readable, repo-scoped string like `f"{repo_name_for_index}:{row['id']}"` (Qdrant supports string point IDs) to avoid needing UUIDs and make overwrite semantics explicit.
3. Add a small log/metric indicating how many points were deleted before reindexing so operators can validate migration behavior.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

)
self.qdrant.upsert(collection_name=self.index_name, points=points)
get_logger().info('Done')
Expand Down
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ giteapy==1.0.8
# pinecone-datasets @ git+https://github.com/mrT23/pinecone-datasets.git@main
# lancedb==0.5.1
# qdrant-client==1.15.1
# pandas # required by qdrant indexing path
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. requirements.txt adds commented pandas 📘 Rule violation ⚙ Maintainability

The PR introduces a new commented-out dependency line for pandas, which is inactive code and
violates the no-commented-out-code requirement. This can also cause runtime failures for users who
enable Qdrant without actually installing pandas.
Agent Prompt
## Issue description
A new commented-out dependency line (`# pandas ...`) was added to `requirements.txt`, violating the requirement to avoid commented-out code and leaving the dependency inactive.

## Issue Context
The code path `_update_qdrant_with_issues` imports/uses `pandas`, so leaving it commented relies on manual user action and can still lead to runtime `ImportError`.

## Fix Focus Areas
- requirements.txt[42-42]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

# uncomment this to support language LangChainOpenAIHandler
# langchain==0.2.0
# langchain-core==0.2.28
Expand Down
Loading