Skip to content
This repository was archived by the owner on Oct 31, 2023. It is now read-only.
This repository was archived by the owner on Oct 31, 2023. It is now read-only.

mismatch between encoded results and wiki passages #250

Description

@Hannibal046

Hi, thanks so much for the great work. I have a question about the size of wiki passages and encoded index. After downloading the data as instructed, I found the size of index doesn't match that of passages:

import pickle,csv

n_embedding = 0
for idx in range(50):
    index_path = f"DPR/dpr/downloads/data/retriever_results/nq/single/wikipedia_passages_{idx}.pkl"
    data = pickle.load(open(index_path,'rb'))
    n_embedding += len(data)


n_doc = 0
wikidata_path = "DPR/dpr/downloads/data/wikipedia_split/psgs_w100.tsv"
docs = []
with open(wikidata_path) as f:
    reader = csv.reader(f, delimiter="\t")
    for row in reader:
        if row[0] == "id":continue
        n_doc += 1

print("n_embedding=",n_embedding)
print("n_doc=",n_doc)

The results are:

n_embedding= 21015300
n_doc= 21015324

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions