Hi,
I am attempting to run the quantized model in real quant mode (as opposed to using fake quant). Is utilizing the load_quantized_model function from quantize.int_linear_real the correct approach to load the model? I am encountering issues executing this function successfully. Furthermore, if the accuracy in eval.py is based on fake quant (running in FP16), can we expect the same accuracy when running the W8A8 quant model?
Hi,
I am attempting to run the quantized model in real quant mode (as opposed to using fake quant). Is utilizing the
load_quantized_modelfunction fromquantize.int_linear_realthe correct approach to load the model? I am encountering issues executing this function successfully. Furthermore, if the accuracy in eval.py is based on fake quant (running in FP16), can we expect the same accuracy when running the W8A8 quant model?