Add a NTCIREVAL provider. It includes several measures not yet covered by the software.
Write a python interface to NTCIREVAL or use pyNTCIREVAL?
There's pyNTCIREVAL, which would certainly be easier to incorporate. But the downside is that it's a port of the software to Python, rather than an interface (the way pytrec_eval is an interface with trec_eval).
Writing a new python interface to NTCIREVAL would be a lot of work (especially after taking a cursory look over the code), and would be done as a separate repo if this is the route pursued. An advantage of this route is that it (theoretically) should then be easier to then support new measures incorporated into the official software. This direction should be given due consideration, as NTCIR is updated periodically.
Supported Measures
From this page:
- Average Precision
- Q-measure
- nDCG
- Expected Reciprocal Rank (ERR)
- Graded Average Precision (GAP)
- Rank-Biased Precision (RBP)
- Expected Blended Ratio (EBR)
- intentwise Rank-Biased Utility (iRBU)
- Normalised Cumulative Utility (NCU)
- Condensed-List versions of the above metrics
- Bpref
- D#-measures and DIN#-measures for diversity evaluation
- Intent-Aware (IA) metrics and P+Q# for diversity evaluation
(Looks like there are others too)
Add a NTCIREVAL provider. It includes several measures not yet covered by the software.
Write a python interface to NTCIREVAL or use pyNTCIREVAL?
There's pyNTCIREVAL, which would certainly be easier to incorporate. But the downside is that it's a port of the software to Python, rather than an interface (the way pytrec_eval is an interface with trec_eval).
Writing a new python interface to NTCIREVAL would be a lot of work (especially after taking a cursory look over the code), and would be done as a separate repo if this is the route pursued. An advantage of this route is that it (theoretically) should then be easier to then support new measures incorporated into the official software. This direction should be given due consideration, as NTCIR is updated periodically.
Supported Measures
From this page:
(Looks like there are others too)