LMB: Augmenting PCIe Devices with CXL-Linked Memory Buffer

Jiapin Wang DapuStor Corporation Xiangping Zhang DapuStor Corporation Chenlei Tang DapuStor Corporation Xiang Chen DapuStor Corporation  and  Tao Lu DapuStor Corporation
(2024)
Abstract.

PCIe devices, such as SSDs and GPUs, are pivotal in modern data centers, and their value is set to grow amidst the emergence of AI and large models. However, these devices face onboard DRAM shortage issue due to internal space limitation, preventing accommodation of sufficient DRAM modules alongside flash or GPU processing chips. Current solutions either curb device-internal memory usage or supplement slower non-DRAM mediums, prove inadequate or performance-compromising. This paper introduces the Linked Memory Buffer (LMB), a scalable solution utilizing the CXL memory expander to tackle device onboard memory deficiencies. The low-latency of CXL enables LMB to utilize emerging DRAM memory expander to efficiently supplement device onboard DRAM with minimal impact on performance.

CXL, Index, Memory Expander, Memory Buffer, SSD
doi: 10.475/1234.5678isbn: 123-4567-24-567/08/06journalyear: 2024price: 15.00

1. Introduction

On-board DRAM shortage in PCIe devices. Key PCIe devices like SSDs, GPUs, and DPUs in big data retrieval (xdp_2021, ), artificial intelligence (G10_2023, ; bam2023, ), and near-data processing (gu2016biscuit, ; wilkening2021recssd, ; liang2019cognitive, ) capitulate to DRAM shortage challenges. High-density and large-capacity QLC SSDs (solidigm_d5p5336, ) are forced to use larger 16KB pages instead of 4KB due to DRAM shortage, suffering increased write amplification. The low indexing efficiency of KV-SSDs (jin2017kaml, ; kang2019towards, ; im2020pink, ) due to lack of memory, hampers their adoption. Memory semantic SSDs (mem_semantic_ssd, ; CXLssd_2022, ), aiming to blend DRAM accessibility and flash durability into a single-tier memory solution, are under development, potentially exacerbating the memory shortage issue. Insufficient GPU DRAM for large AI models necessitates reliance on SSDs for additional storage (G10_2023, ; bam2023, ) at the expense of performance. Likewise, DPUs like XDP (xdp_2021, ) rely on host memory for substantial index storage, with the on-board memory playing a minor buffer/cache role due to its limitations.

Refer to caption
Figure 1. Internal media layout of a commercial SSD. Obviously, there is no room for more DDR in the “Inn”.

Root cause: there is no room for more DRAM in the “Inn”. Enterprise SSD’s standard allocation for DRAM is 0.1% of capacity for 4KB page mapping. Mainstream DRAM technology caps SSD internal memory at 32GB, while QLC technology affords beyond 32TB storage in a U.2 form factor. Nevertheless, integrating additional DRAM in SSDs is a challenge. SSD onboard DRAM shortage stems from spatial constraints. DRAM must be situated close to the SSD controller, similar to server memory’s proximity to CPU socket, constraining DRAM expansion. Figure 1 illustrates the spatial restrictions on the DRAM’s positioning in a dual-board SSD, stressing peripheral space limitations near the controller. The growth rate of DRAM density lags behind flash memory, exacerbating the insufficient SSD onboard DRAM issue. Furthermore, DRAM capacity of GPUs is also impeded by spatial restrictions.

Existing DRAM conservation and extension solutions possess notable constraints. Two primary strategies manage internal DRAM shortages: demand suppression and resource supplementation, subdivided into either leveraging internal/external flash or host DRAM.

DFTL (gupta2009dftl, ) utilizes flash instead of DRAM for L2P indexing. However, it has performance limitations due to double flash reading, one for index and another for data, thus solely suit for mobile devices. SFTL (jiang2011sftl, )/LeaFTL (sun2023leaftl, ) reduce L2P footprint through index compression, which causes side effects of efficiency and application-specific limitations. DRAM-less host memory buffer (HMB) consumer SSDs underperform in data centers. SSDs adopting large page size diminish index size at the expense of write performance and media lifespan. KV-SSDs mitigate DRAM constraints by sacrificing partial storage or adopting LSM-tree flash index (im2020pink, ).

Unified virtual memory (uvm, ) is a practice to addressing GPU memory shortage, but it causes memory swap overhead. Latest advancements demonstrate high-performance SSDs for memory extension (G10_2023, ; bam2023, ), surpassing CUDA’s unified memory in capacity. However, the SSD’s slower access speed compared to DRAM creates performance bottlenecks (bam2023, ).

A universal and scalable solution: linked memory buffer (LMB). As an SSD supplier, we have faced significant hurdles with memory overhead in large-capacity SSDs. Witnessed the limitations of existing DRAM demand suppression and resource supplementation mechanisms, we believe a system-level architectural solution is needed to address PCIe devices’ memory shortage issue. Driven by the advent of the CXL high-speed interconnection protocol, LMB uses the principle of exchanging time for space to combat device space scarcity. It dynamically expands PCIe device memory, permitting memory sharing between CXL and PCIe devices based on efficient P2P access or host forwarding.

LMB challenges. Significant issues include dynamic memory allocation, resource optimization, shared resource isolation, access control, cross-device migration, and data security to prevent single points of failure. The dilemma lies between pre-reserving for guaranteed availability or allocating on-demand for efficiency. A single failure in the memory expander can render all devices unavailable. Performance interference due to multiple devices accessing shared memory adds complexity. Addressing these issues is critical.

Our vision. We propose LMB as a CXL-based memory extension framework and kernel module, using a CXL memory expander as the physical DRAM source. It aims to provide a unified memory allocation interface for PCIe and CXL devices. This permits NVMe and CUDA’s unified memory kernel driver to access the CXL memory expander directly and efficiently, enabling SSD and GPU devices to utilize LMB’s memory resources as effortlessly as on-board memory.

2. Background and Related Work

2.1. Addressing SSD DRAM Shortage

Flash-backed secondary index. One way to override the memory wall is to swap indexes to NAND flash. Demand-based FTL(DFTL) (gupta2009dftl, ) cache frequently used L2P entries in DRAM while retaining the remaining indexes in flash. Spatial-locality-aware FTL(SFTL) (jiang2011sftl, ) employs spatial locality with strictly sequential access to compress L2P entries. These methods are prone to performance degradation due to the latency discrepancy between flash memory and DRAM.

Supplementing device memory using host memory buffer (HMB). The NVMe 1.2 standard (nvme1_2, ) enhanced SSD memory using host’s DRAM through PCIe’s DMA abilities. To offset performance degradation caused by limited SRAM in flash modules for L2P indexing in mobile applications (kim2023integrated, ), Host Performance Booster (HPB) places L2P entries in host memory. The HMB scheme challenges the host memory scalability and thus only applicable in the scenario that the DRAM requirement is not large (e.g. hundreds of MBs).

Index footprint reduction. As high-density QLC SSD products permeate the market, the need for expanded onboard DRAM for refined L2P mapping tables becomes apparent. The Solidigm D5-p5336 QLC SSD (solidigm_d5p5336, ) addresses this by introducing a 16KB Indirection Unit (IU) coarse-grained mapping table that reduces DRAM consumption. Although cost-effective, coarse-graining yields issues such as read-modify-write and write amplification compromising SSD’s performance. As an alternative, LeaFTL (sun2023leaftl, ) uses piecewise linear regression for LPA-PPA mappings to minimize index memory overhead. However, it struggles with random writes and DRAM reduction cannot be assured.

Key-value and Memory-semantic SSDs. Initiatives like KV-SSD optimize system performance and software efficiency (jin2017kaml, ; kang2019towards, ; im2020pink, ; kwon2022vigil, ), but face challenges from extensive DRAM overheads. Memory-semantic SSDs (mem_semantic_ssd, ; CXLssd_2022, ) combine cost-effective, byte-addressable DRAM with SSDs for extensive flash space, using memory as CPU-accessible cache. However, they are reliant on DRAM size and cache hit ratios, with misses leading to latency issues and the spatial limitation persists due to the identical form factor as SSDs.

2.2. Addressing GPU DRAM Shortage

Deep Neural Networks (DNN) and large-scale model training on GPUs often encounter memory limitations (cano2018survey, ; li2016performance, ; wang2022merlin, ; kwon2018beyond, ; peng2018optimus, ). Mitigation strategies primarily engage unified GPU and host memory (allen2021depth, ) or SSD-enhanced GPU memory (G10_2023, ; bam2023, ).

CUDA unified memory. Larger host system memory can partially supplement GPU memory constraints. This transparent cushioning is achieved via Unified Virtual Memory (UVM) (uvm, ; allen2021depth, ) for flexible data migration and accessibility. However, UVM proves inadequate in obviating GPU memory shortages for extensive dataset training (e.g. LLM) and substantial host-GPU memory migration overhead.

SSD-GPU DRAM extension. BaM (bam2023, ) and G10 (G10_2023, ) successfully overcome UVM’s data transfer concerns by integrating SSDs as external memory. BaM bypasses intermediary data copying by facilitating direct SSD access to GPU threads. Aside from software and driver changes needed to shift NVMe queues, and IO buffers to GPU memory, BaM demonstrates a marked decrease in SSD-to-GPU data load times during benchmark testing. However, its utility is less perceptible when the graph-processing timeframe exceeds data loading in complex image analysis tasks.

2.3. Promises of CXL

CXL was proposed to address the scalability constraints of host memory. CXL constructs disaggregated memory through PCIe to overcome the limitations in CPU memory channels and DIMM slots. Hence, we advocate the use of CXL’s scalable and disaggregated memory to extend both NVMe HMB (nvme1_2, ) and Unified Memory infrastructures (uvm, ).

Refer to caption
Figure 2. Estimated latency of PCIe Gen5, and CXL devices accessing host and CXL HDM memory (sharma2022compute, ; li2023pond, ).

CXL enables efficient interconnectivity for both Host-to-Device and Device-to-Device. By acquiring a Port Based Routing (PBR) ID from connecting either a host or devices to CXL-based PBR switch’s Edge Port, it permits access to Global Fabric-Attached Memory (GFAM) devices (CXL_spec, ). Exceeding host memory, GFAM’s expansibility allows usage within device memory mitigating host applications interference and promoting device-driven enhancements. The Direct P2P attribute bolsters CXL devices in bypassing the host to access similar devices via a shortcut through a switch. CXL port latency is 25ns (sharma2022compute, ), 70ns for switch latency including HDM access (li2023pond, ), and 780ns for PCIe 5.0 devices accessing host memory, as presented in Figure 2. CXL could prove instrumental in surpassing the memory hurdle.

3. Design of LMB framework

Table 1. CXL and LMB Terminology.
Term Description
HDM Host-managed Device Memory
FAM Fabric-Attached Memory. HDM within a Type 2/3 device that is accessible to multiple hosts.
GFD Global FAM device
FM Fabric Manager controls aspects of the system related to binding and management of pooled ports and devices.
DPA Device Physical Address
DMP Device Media Partition. DPA ranges with certain attributes.
PBR Port Based Routing
SPID Source PBR ID
SAT SPID Access Table

Building CXL memory pool to extend host memory has been widely studied (gouk2023memory, ; ha2023dynamic, ; li2023pond, ; kim2023smt, ), but there lacks attention to extend device memory via CXL. We propose a unified framework called CXL Linked Memory Buffer(LMB) to extend device memory. The overall architecture of the LMB is shown in Figure 3 and relevant terminology is listed in Table 1. CXL memory expander is connected to CXL switch, which exposes multiple pooled HDMs to hosts and devices, while providing basic address mapping, access control and other mechanisms. Based on the CXL protocol (CXL_spec, ), LMB mounts the Expander as a GFD that can provide memory expansion for either CXL devices attached in the CXL switch or other PCIe devices of the host. More specifically, we plan to implement a unified framework in the kernel that can support both CXL devices and PCIe devices, and provide APIs for device drivers, such as memory allocation and deallocation, access control, etc. For a CXL device, it can directly P2P access the Expander via CXL.mem or UIO access via CXL.io. For a PCIe device, its memory access requests are first converted to CXL.mem requests by host, and then redirected to the Expander.

Refer to caption
Figure 3. Overall architecture of LMB.

3.1. LMB Component

CXL memory expander and fabric manager. The Expander serves as a GFD to provide global memory resources to hosts and CXL/PCIe devices in the entire Fabric, supporting large-scale memory pooling and dynamic capacity management. It is directly managed by the FM (Fabric Manager). The FM, a key component in the CXL architecture, is responsible for managing and configuring devices and resources in the CXL Fabric, which can be implemented as software in the host or firmware on a switch or a CXL device. Hosts on Fabric can query and configure the state of Expander through FM APIs to realize dynamic memory allocation among multiple hosts. The Expander manages the internal HDMs to translate HPAs in host requests into DPAs through address mapping. The Expander’s DPA space is organized according to Device Media Partition (DMP) and supports DRAM and PM heterogeneous media, as shown in Figure 4.

Refer to caption
Figure 4. Expander address mapping.

LMB kernel module. The memory access protocols of PCIe devices and CXL devices are completely different. Moreover, existing CXL memory pool design is difficult to accommodate PCIe devices. To address this problem, we treat the host as a bridge, and implement the LMB kernel module to provide a uniform memory allocation and sharing interface to both PCIe devices and CXL devices. The kernel module first requests a memory block from the FM and then interacts with the device driver to allocate memory for it. In addition, we promote the loading priority of the LMB module to avoid memory request failure in the initialization phase of the PCIe device driver.

3.2. Memory Management

Memory allocator. The kernel module requests memory from the Expander through the FM API (CXL_spec, ), and the obtained memory is mapped into the physical address space of the host, waiting to be allocated to the local device. When a kernel module does not have enough free memory to complete the allocation, it requests a single 256MB block from the Expander. When all device memory in a memory block has been freed, the kernel module releases the area to FM. We keep the memory allocator metadata in the host to facilitate the alignment of large memory mappings and avoid triggering multiple CXL memory accesses.

Data path. The PCIe devices cannot directly access the Expander using the CXL protocol, but they can initiate reads and writes to the HDM mapped physical address. The PCIe TLP (Transaction Layer Packet) is converted by the CPU into MemRd/MemWr commands in the CXL.mem protocol. To the Expander, this will be treated as a memory access by the host. Since the PCIe device does not support CXL cache coherency, it sets this memory to the uncached type. When the PCIe device and the CXL device share memory, although the PCIe device cannot receive Back-Invalidate Snoop from the Expander, it does not cause consistency problems.

Table 2. LMB kernel API
Operation Interface
Allocate lmb_PCIe_alloc(*dev, size, *hpa, *mmid)
lmb_CXL_alloc(*CXLd, size, *hpa, *DPID, *mmid)
Free lmb_PCIe_free(*dev, mmid)
lmb_CXL_free(*CXLd, mmid)
Share lmb_PCIe_share(*dev, mmid, *hpa)
lmb_CXL_share(*CXLd, mmid, *hpa, *DPID)

3.3. LMB APIs and Access Control

APIs for driver usage. The kernel module exposes the APIs shown in Table 2 for device drivers that support the alloc, free, and share interfaces. The Figure 5 shows the process of an SSD applying for LMB to store the L2P table. The PCIe device will obtain the bus address accessible to the device and a unique memory ID in the local host. In addition to the HPA (Host Physical Address) of the GFAM, the CXL device also gets the global PID of the expander to initiate P2P requests. The shared memory can be used as an input or output buffer to reduce memory copy between devices. For example, sending data from SSDs to accelerators for computation, a zero-copy data path can be achieved with the help of shared memory. The kernel module maintains the mapping of HPA and bus addresses to physical addresses, and the memory sharing between PCIe devices and CXL devices can be completed by address translation.

To protect memory safety between devices, it is necessary to restrict a device from accessing memory ranges that do not belong to it. For PCIe devices, it is common to use IOMMU to isolate the range of memory that can be accessed by the device. The access control of CXL device to GFD is managed by SAT (SPID Access Table), and GFD can identify the CXL device or host that initiates the request according to the SPID field in the memory request. LMB integrates the above two access control modes. When memory is requested by a PCIe device, the kernel module creates the IOMMU page tables for the allocated memory. For CXL devices, the kernel module adds the SPID to the SAT by means of the GFD Component Management Command Set. When a release or share is made, the associated entries are also updated.

Refer to caption
Figure 5. SSD stores the L2P table through the LMB.

4. EVALUATION

In this position paper, we present our preliminary simulation results to demonstrate the effectiveness of LMB for extending SSD DRAM for L2P indexing. We compare LMB with Ideal (all mapping table in onboard DRAM) and DFTL (gupta2009dftl, ) schemes. The LMB scheme can be further classified as LMB-CXL and LMB-PCIe to evaluate scenarios that different devices access external memory via LMB. We also use PCIe Gen4 and Gen5 SSDs (shown in Table 3) to evaluate the impact of different PCIe standards. We evaluate these schemes with FIO (fio, ) under workloads of rand/seq writes and rand/seq reads. Additionally, we use the libaio IO engine with a queue depth of 64 and an IO size of 4KB.

Prototype implementation. LMB and DFTL schemes store mapping tables either on CXL memory or on flash. Due to the scarcity of real CXL development boards, we simulate LMB-CXL and LMB-PCIe on a PCIe Gen4 SSD and a PCIe Gen5 SSD with modified firmware, particularly the L2P indexing module. We use two devices mainly to observe the potential impact of CXL extra latency on devices with different baseline performance. Baseline performance variations between the two SSDs result in different simulation outputs under a same condition.

We add latency in the Ideal scheme’s L2P indexing for LMB and DFTL simulation. An additional 25μ𝜇\muitalic_μs (a flash read operation) is added in Ideal to simulate DFTL cache miss, while 880ns and 1190ns is added to simulate LMB-PCIe on PCIe Gen4 and Gen5 SSDs, respectively. A 190ns latency is added to simulate LMB-CXL.

Table 3. SSD Specifications
Parameter PCIe Gen4 x4 PCIe Gen5 x4
Capacity (TB) 7.68 7.68
4KB Rand R/W IOPS 1750/340 2800/700
128KB Seq R/W BW (GB/s) 7.2/6.8 14/10
4KB Rand R/W Lat.(μ𝜇\muitalic_μs) 67/9 56/8
Refer to caption
(a) PCIe Gen4 SSD.
Refer to caption
(b) PCIe Gen5 SSD.
Figure 6. Comparison of LMB scheme with Ideal (DRAM) and DFTL in the context of PCIe SSD L2P index scenarios.

4.1. Evaluation on Commercial SSDs

4.1.1. PCIe Gen4 SSD Evaluation

We first evaluate the performance of the LMB on a PCIe Gen4 SSD, which is the mainstream product in current SSD market.

Figure 6(a) illustrates the performance of different schemes. For write workloads, LMB-CXL and LMB-PCIe match the Ideal scheme’s throughput, outperforming DFTL by 7×7\times7 ×. For read workloads, LMB-CXL maintains similar Ideal scheme performance, yet LMB-PCIe experiences a 16.6% sequential and 13.3% random read performance drop. Still, it surpasses DFTL by 14×14\times14 ×. Given the superior performance of the TLC SSD over the QLC SSD, we posit that the LMB mechanism will effectively handle L2P indexing on high-capacity QLC SSDs to resolve their onboard memory shortage issue.

4.1.2. PCIe Gen5 SSD Evaluation

Figure 6(b) compares the performance of different schemes on a PCIe Gen5 SSD. For write workloads, both LMB-CXL and LMB-PCIe deliver similar performance to the Ideal scheme, which can achieve 20×20\times20 × higher throughput than the DFTL scheme. For read workloads, both LMB-CXL and LMB-PCIe show performance degradation. More specifically, LMB-CXL achieves 8% and 56% lower throughput than the Ideal scheme for sequential and random reads, respectively. Meanwhile, LMB-PCIe shows a more severe performance degradation. It shows 62% and 70% lower throughput than the Ideal scheme for sequential and random reads. Despite that, LMB-PCIe can outperform the DFTL scheme by 20×20\times20 ×.

Test results indicate that introducing hundreds of nanoseconds of additional CXL latency in the indexing significantly impacts high-performance SSD performance. However, these results assume that all indexing are supported by CXL extended memory. By exploiting the locality of actual workloads where most indices hit on-board memory, the impact on device performance by the CXL secondary index will be considerably dismissed.

5. CONCLUSION

We present a unified LMB framework designed to address the issue of onboard DRAM shortages for PCIe devices. This framework exposes the DRAM resource of the CXL memory expander, providing a consolidated memory management mechanism for both CXL and PCIe devices. Upon investigating this issue in real SSDs, it is evident that our LMB solution effectively supplements onboard DRAM and ensures high performance. The rigorous design and implementation of the LMB framework will tackle various challenges, which we will pursue in future work.

References

  • [1] Flexible i/o tester. https://github.com/axboe/fio.
  • [2] Nvm express 1.2a. https://www.nvmexpress.org/wp-content/uploads/NVM-Express-1_2a.pdf.
  • [3] Samsung cxl solutions – cmm-h. https://semiconductor.samsung.com/news-events/tech-blog/samsung-cxl-solutions-cmm-h/. Accessed: April 6, 2024.
  • [4] Solidigm d5-p5336 ssd. https://www.solidigm.com/products/data-center/d5/p5336.html. Accessed: December 23, 2023.
  • [5] Tyler Allen and Rong Ge. In-depth analyses of unified virtual memory system for gpu accelerated computing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2021.
  • [6] Alberto Cano. A survey on graphic processing unit computing for large-scale data mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(1):e1232, 2018.
  • [7] CXL Consortium. Cxl 3.0 specification. https://computeexpresslink.org/cxl-specification/.
  • [8] Niv Dayan, Moshe Twitto, Yuval Rochman, Uri Beitler, Itai Ben Zion, Edward Bortnikov, Shmuel Dashevsky, Ofer Frishman, Evgeni Ginzburg, Igal Maly, Avraham (Poza) Meir, Mark Mokryn, Iddo Naiss, and Noam Rabinovich. The end of moore’s law and the rise of the data processor. Proc. VLDB Endow., 14(12):2932–2944, jul 2021.
  • [9] Donghyun Gouk, Miryeong Kwon, Hanyeoreum Bae, Sangwon Lee, and Myoungsoo Jung. Memory pooling with cxl. IEEE Micro, 43(2):48–57, 2023.
  • [10] Boncheol Gu, Andre S Yoon, Duck-Ho Bae, Insoon Jo, Jinyoung Lee, Jonghyun Yoon, Jeong-Uk Kang, Moonsang Kwon, Chanho Yoon, Sangyeun Cho, et al. Biscuit: A framework for near-data processing of big data workloads. ACM SIGARCH Computer Architecture News, 44(3):153–165, 2016.
  • [11] Aayush Gupta, Youngjae Kim, and Bhuvan Urgaonkar. Dftl: a flash translation layer employing demand-based selective caching of page-level address mappings. Acm Sigplan Notices, 44(3):229–240, 2009.
  • [12] Minho Ha, Junhee Ryu, Jungmin Choi, Kwangjin Ko, Sunwoong Kim, Sungwoo Hyun, Donguk Moon, Byungil Koh, Hokyoon Lee, Myoungseo Kim, et al. Dynamic capacity service for improving cxl pooled memory efficiency. IEEE Micro, 43(2):39–47, 2023.
  • [13] Mark Harris. Unified memory for cuda beginners — nvidia technical blog. https://developer.nvidia.com/blog/unified-memory-cuda-beginners/, 2017.
  • [14] Junsu Im, Jinwook Bae, Chanwoo Chung, Sungjin Lee, et al. Pink: High-speed in-storage key-value store with bounded tails. In 2020 USENIX Annual Technical Conference (USENIX ATC 20), pages 173–187, 2020.
  • [15] Song Jiang, Lei Zhang, XinHao Yuan, Hao Hu, and Yu Chen. S-ftl: An efficient address translation for flash memory by exploiting spatial locality. In 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST), pages 1–12. IEEE, 2011.
  • [16] Yanqin Jin, Hung-Wei Tseng, Yannis Papakonstantinou, and Steven Swanson. Kaml: A flexible, high-performance key-value ssd. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 373–384. IEEE, 2017.
  • [17] Myoungsoo Jung. Hello bytes, bye blocks: Pcie storage meets compute express link for memory expansion (cxl-ssd). In Proceedings of the 14th ACM Workshop on Hot Topics in Storage and File Systems, HotStorage ’22, page 45–51, New York, NY, USA, 2022. Association for Computing Machinery.
  • [18] Yangwook Kang, Rekha Pitchumani, Pratik Mishra, Yang-suk Kee, Francisco Londono, Sangyoon Oh, Jongyeol Lee, and Daniel DG Lee. Towards building a high-performance, scale-in key-value storage system. In Proceedings of the 12th ACM International Conference on Systems and Storage, pages 144–154, 2019.
  • [19] Kyungsan Kim, Hyunseok Kim, Jinin So, Wonjae Lee, Junhyuk Im, Sungjoo Park, Jeonghyeon Cho, and Hoyoung Song. Smt: Software-defined memory tiering for heterogeneous computing systems with cxl memory expander. IEEE Micro, 43(2):20–29, 2023.
  • [20] Yoona Kim, Inhyuk Choi, Juhyung Park, Jaeheon Lee, Sungjin Lee, and Jihong Kim. Integrated host-ssd mapping table management for improving user experience of smartphones. In 21st USENIX Conference on File and Storage Technologies (FAST 23), pages 441–456, 2023.
  • [21] Miryeong Kwon, Seungjun Lee, Hyunkyu Choi, Jooyoung Hwang, and Myoungsoo Jung. Vigil-kv: Hardware-software co-design to integrate strong latency determinism into log-structured merge key-value stores. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 755–772, 2022.
  • [22] Youngeun Kwon and Minsoo Rhu. Beyond the memory wall: A case for memory-centric hpc system for deep learning. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 148–161. IEEE, 2018.
  • [23] Huaicheng Li, Daniel S Berger, Lisa Hsu, Daniel Ernst, Pantea Zardoshti, Stanko Novakovic, Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, et al. Pond: Cxl-based memory pooling systems for cloud platforms. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 574–587, 2023.
  • [24] Xiaqing Li, Guangyan Zhang, H Howie Huang, Zhufan Wang, and Weimin Zheng. Performance analysis of gpu-based convolutional neural networks. In 2016 45th International conference on parallel processing (ICPP), pages 67–76. IEEE, 2016.
  • [25] Shengwen Liang, Ying Wang, Youyou Lu, Zhe Yang, Huawei Li, and Xiaowei Li. Cognitive ssd: A deep learning engine for in-storage data retrieval. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 395–410, 2019.
  • [26] Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference, pages 1–14, 2018.
  • [27] Zaid Qureshi, Vikram Sharma Mailthody, Isaac Gelado, Seungwon Min, Amna Masood, Jeongmin Park, Jinjun Xiong, C. J. Newburn, Dmitri Vainbrand, I-Hsin Chung, Michael Garland, William Dally, and Wen-mei Hwu. Gpu-initiated on-demand high-throughput storage access in the bam system architecture. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2023, page 325–339, New York, NY, USA, 2023. Association for Computing Machinery.
  • [28] Debendra Das Sharma. Compute express link®: An open industry-standard interconnect enabling heterogeneous data-centric computing. In 2022 IEEE Symposium on High-Performance Interconnects (HOTI), pages 5–12. IEEE, 2022.
  • [29] Jinghan Sun, Shaobo Li, Yunxin Sun, Chao Sun, Dejan Vucinic, and Jian Huang. Leaftl: A learning-based flash translation layer for solid-state drives. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 442–456, 2023.
  • [30] Zehuan Wang, Yingcan Wei, Minseok Lee, Matthias Langer, Fan Yu, Jie Liu, Shijie Liu, Daniel G Abel, Xu Guo, Jianbing Dong, et al. Merlin hugectr: Gpu-accelerated recommender system training and inference. In Proceedings of the 16th ACM Conference on Recommender Systems, pages 534–537, 2022.
  • [31] Mark Wilkening, Udit Gupta, Samuel Hsia, Caroline Trippel, Carole-Jean Wu, David Brooks, and Gu-Yeon Wei. Recssd: near data processing for solid state drive based recommendation inference. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 717–729, 2021.
  • [32] Haoyang Zhang, Yirui Zhou, Yuqi Xue, Yiqi Liu, and Jian Huang. G10: Enabling an efficient unified gpu memory and storage architecture with smart tensor migrations. In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’23, page 395–410, New York, NY, USA, 2023. Association for Computing Machinery.