I tried two approaches to utilize multiple channels.
(1) The first was simply setting controllers=4 in the cfg file.
mem = {
controllers = 4;
type = "DDR";
ranksPerChannel = 4;
banksPerRank = 8;
tech="DDR4-3200-CL22";
};
Compared to controllers=1, there was almost no significant difference. The metric I focus on is the IPC corresponding to the CPU in the output file zsim.out.
mem = {
controllers = 1;
type = "DDR";
ranksPerChannel = 4;
banksPerRank = 8;
tech="DDR4-3200-CL22";
};
(2) The second approach draws inspiration from the implementation of banshee[https://github.com/yxymit/banshee]. It involves creating four DDR channels in an array-like structure to form a multi-channel DDR (mcdram). Memory requests are then distributed across channels by performing a modulo operation based on the number of channels to determine which channel handles each request.
_mcdram = (MemObject **) gm_malloc(sizeof(MemObject *) * _mcdram_per_mc);
for (uint32_t i = 0; i < _mcdram_per_mc; i++) {
g_string mcdram_name = _name + g_string("-mc-") + g_string(to_string(i).c_str());
// ...
} else if (_mcdram_type == "DDR") {
// XXX HACK tBL for mcdram is 1, so for data access, should multiply by 2, for tad access, should multiply by 3.
_mcdram[i] = BuildDDRMemory(config, frequency, domain, mcdram_name, "sys.mem.mcdram.", 1, timing_scale);
}//....
Address address = req.lineAddr;
uint32_t mcdram_select = (address / 64) % _mcdram_per_mc;
Address mc_address = (address / 64 / _mcdram_per_mc * 64) | (address % 64);
//...
if (_scheme == CacheOnly) {
req.lineAddr = mc_address;
req.cycle = _mcdram[mcdram_select]->access(req, 0, 4);
req.lineAddr = address;
_numLoadHit.inc();
futex_unlock(&_lock);
return req.cycle;
}
//...
Unfortunately, I still observed almost identical performance (IPC) compared to the pure DDR setup with controllers=1.
To gain a deeper understanding of this issue, I referred to several past issues. For instance, I experimented with modifying tCK to increase bandwidth and adjusting tBL. While these changes had some effect, the improvements were not significant. I also examined the zsim-ndp[https://github.com/CriusT/zsim-ndp] implementation of MemChannel[https://github.com/CriusT/zsim-ndp/blob/master/src/mem_channel.cpp], but encountered similar performance challenges. I have also tried modifying the memory interleaving approach, but the results were still not good.
I added debugging information in the **trySchedule** function of **ddr_mem.cpp**. By comparing the debug output, I found that the two aforementioned methods for constructing multi-channel DDR systems exhibited almost identical r->arrivalCycle sequences. When timing parameters such as tBL were modified, only numerical differences appeared, but the pattern remained largely consistent.
uint64_t DDRMemory::trySchedule(uint64_t curCycle, uint64_t sysCycle) {
//...
std::cout << curCycle << " Found ready request 0x" << r->addr << " r->arrCycle= " << r->arrivalCycle << std::endl;
//...
}
I encountered a similar issue when using gem5 Simulator. This raises the question: are these discrete-event-driven simulators inherently limited in accurately simulating the parallelism achievable with multi-channel memory systems, particularly their ability to exploit high bandwidth through concurrency?
Thank you for any useful suggestions!
I tried two approaches to utilize multiple channels.
(1) The first was simply setting
controllers=4in thecfgfile.Compared to
controllers=1, there was almost no significant difference. The metric I focus on is the IPC corresponding to the CPU in the output filezsim.out.(2) The second approach draws inspiration from the implementation of
banshee[https://github.com/yxymit/banshee]. It involves creating four DDR channels in an array-like structure to form a multi-channel DDR (mcdram). Memory requests are then distributed across channels by performing a modulo operation based on the number of channels to determine which channel handles each request.Unfortunately, I still observed almost identical performance (IPC) compared to the pure DDR setup with controllers=1.
To gain a deeper understanding of this issue, I referred to several past issues. For instance, I experimented with modifying tCK to increase bandwidth and adjusting tBL. While these changes had some effect, the improvements were not significant. I also examined the
zsim-ndp[https://github.com/CriusT/zsim-ndp] implementation of MemChannel[https://github.com/CriusT/zsim-ndp/blob/master/src/mem_channel.cpp], but encountered similar performance challenges. I have also tried modifying the memory interleaving approach, but the results were still not good.I added debugging information in the
**trySchedule**function of**ddr_mem.cpp**. By comparing the debug output, I found that the two aforementioned methods for constructing multi-channel DDR systems exhibited almost identicalr->arrivalCyclesequences. When timing parameters such as tBL were modified, only numerical differences appeared, but the pattern remained largely consistent.I encountered a similar issue when using
gem5Simulator. This raises the question: are these discrete-event-driven simulators inherently limited in accurately simulating the parallelism achievable with multi-channel memory systems, particularly their ability to exploit high bandwidth through concurrency?Thank you for any useful suggestions!