Search

Article

x

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

A compute-in-memory architecture and system-technology codesign simulator based on 3D NAND flash

ZHENG Hao LIU Huiwen FANG Yuxuan FAN Dongyu HAN Yuhui HOU Chunyuan LIU Wei XIA Zhiliang HUO Zongliang

Citation:

A compute-in-memory architecture and system-technology codesign simulator based on 3D NAND flash

ZHENG Hao, LIU Huiwen, FANG Yuxuan, FAN Dongyu, HAN Yuhui, HOU Chunyuan, LIU Wei, XIA Zhiliang, HUO Zongliang
Article Text (iFLYTEK Translation)
PDF
HTML
Get Citation
  • The rapid advancement of large language models (LLM) such as ChatGPT has imposed unprecedented demands on hardware in terms of computational power, memory capacity, and energy efficiency. Compute-in-memory (CIM) technology, which integrates computing directly into memory arrays, has become a promising solution that can overcome the data movement bottlenecks of traditional von Neumann architectures, significantly reduce power consumption and achieve large-scale parallel processing. Among various non-volatile memory candidates, 3D NAND flash stands out due to its mature manufacturing process, ultrahigh density, and cost-effectiveness, making it a strong contender for commercial CIM deployment and local inference of large models.Despite these advantages, most of existing researches on 3D NAND-based CIM remain at an academic level, focusing on theoretical designs or small-scale prototypes, with little attention paid to system-level architecture design and functional validation using product-grade 3D NAND chips for LLM applications. To address this gap, we propose a novel CIM architecture based on 3D NAND flash, which utilizes a source line (SL) slicing technique to partition the array and perform parallel computation at minimal manufacturing cost. This architecture is complemented by an efficient mapping algorithm and pipelined dataflow, enabling system-level simulation and rapid industrial iteration.We develop a PyTorch-based behavioral simulator for LLM inference on the proposed hardware, evaluating the influences of current distribution and quantization on system performance. Our design supports INT4/INT8 quantization and employs dynamic weight storage logic to minimize voltage switching overhead, and is further optimized through hierarchical pipelining to maximize throughput under hardware constraints.Simulation results show that our simulation-grade 3D NAND compute-in-memory chip reaches generation speeds of 20 tokens/s with an energy efficiency of 5.93 TOPS/W on GPT-2-124M and 8.5 tokens/s with 7.17 TOPS/W on GPT-2-355M, respectively, while maintaining system-level reliability for open-state current distributions with σ < 2.5 nA; in INT8 mode, quantization error is the dominant accuracy bottleneck.Compared with previous CIM solutions, our architecture supports larger model loads, higher computational precision, and significantly reduced power consumption, as evidenced by comprehensive benchmarking. The SL slicing technique keeps array wastage below 3%, while hybrid wafer-bonding integrates high-density ADC/TIA circuits to improve hardware resource utilization.This work represents the first system-level simulation of LLM inference on product-grade 3D NAND CIM hardware, providing a standardized and scalable reference for future commercialization. The complete simulation framework is released on GitHub to facilitate further research and development. Future work will focus on device-level optimization of 3D NAND and iterative improvements of the simulator algorithm.
  • 图 1  GPT-2-124M 模型算法示意图 (a) GPT-2-124M 整体算法流程; (b) 掩码多头注意力计 算流程; (c) 前向全连接层计算流程

    Figure 1.  Schematic diagram of the GPT-2-124M model algorithm: (a) Overall algorithm flow of GPT-2-124M; (b) masked multi-head attention computation flow; (c) forward fully connected layer computation flow

    图 2  3D NAND闪存的向量矩阵乘法操作及其电路示意图

    Figure 2.  Vector-matrix multiplication operation and its circuit schematic of 3D NAND flash memory.

    图 3  3D NAND-SS架构完整示意图 (a) 系统架构示意图; (b) 控制器数据指令流; (c) 3D NAND-SS阵列示意图

    Figure 3.  The 3D NAND-SS architecture: (a) System architecture diagram; (b) controller data and instruction flow; (c) 3D NAND-SS array diagram.

    图 4  单个Block的源线N型多晶硅区域经切分后形成的SL分区俯视图/斜视图

    Figure 4.  Top and oblique views of the SL partitions formed by segmenting the source-line N-type polysilicon region in a single block.

    图 5  3D NAND-SS 输入和权重映射方案, 输入和权重均被量化成多个1 bit数据沿着横纵两个方向排布

    Figure 5.  The 3D NAND-SS input and weight mapping scheme: Both inputs and weights are quantized into multiple 1-bit data arranged in horizontal and vertical directions.

    图 6  3D NAND-SS 计算过程阵列示意图

    Figure 6.  3D NAND-SS computational process array schematic diagram.

    图 7  3D NAND-SS计算流水线设计, 处理多TIA复用ADC的情况

    Figure 7.  The 3D NAND-SS array computation pipeline design for handling multiple TIA multiplexed ADCs.

    图 8  (a) 不同量化数下模型token概率(提示词: “How is the weather today?”); (b) 不同开态电流分布下GPT-2-124M模型输出差值概率分布

    Figure 8.  (a) Token probabilities of the model under different quantization bit widths (prompt: “How is the weather today?”); (b) output probability distribution of the GPT-2-124M model under different open-state current distributions.

    图 9  采用非对称量化和对称量化下的多层感知机下投影层计算误差对比图

    Figure 9.  Comparison of computation errors in the MLP down-projection layer under asymmetric and symmetric quantization.

    图 10  (a) The 3D NAND-SS架构生成10 Tokens的时间组成; (b) 3D NAND-SS架构生成10 Tokens的功耗组成

    Figure 10.  (a) Time composition for generating 10 Tokens in the 3D NAND-SS architecture; (b) power composition for generating 10 Tokens in the 3D NAND-SS architecture.

    表 1  仿真平台部分参数表

    Table 1.  Partial parameter table of the simulation platform.

    参数名功能参数名功能
    Quantization bits量化数Block setup time时间常数
    Current mean /Scale器件开态电流分布均值/标准差WL switch time时间常数
    Blocks/Operation单次计算操作的Block数量TSG switch time时间常数
    Max current sum单次计算求和的电流数BL switch time时间常数
    Symmetric mode是否采用对称量化TIA conversion time时间常数
    ADC multiplexing factorADC复用数ADC conversion time时间常数
    X path current横向通道电流Planes/Die硬件常数
    Y Path current纵向通道电流Layers/Die硬件常数
    Ve电压Blocks/Plane硬件常数
    Background current背景电流TSGs/Block硬件常数
    Num of TIAsTIA数量Bit lines/Plane硬件常数
    DownLoad: CSV

    表 2  3D NAND-SS 硬件参数

    Table 2.  3D NAND-SS hardware configuration.

    硬件参数名 硬件参数名
    Plane数每芯片 4 Layer数每芯片 32
    Block数每Plane 216 纵向切分数 216
    TSG数每Block 10 横向切分数 1024
    BL数每Plane 131072 ADC最大分辨率 128
    *缩减层数用于简化仿真, 实际产品为128层
    DownLoad: CSV

    表 3  GPT-2-124M模型参数

    Table 3.  GPT-2-124M model parameters.

    模型层名计算硬件参数形状参数量(INT8)
    嵌入层CPU/GPU(50256, 768)
    QKV投影层3D NAND-SS(768, 2304)13.5 MB
    注意力矩阵
    计算
    CPU/GPU(序列长度, 768)
    注意力矩阵
    投影
    3D NAND-SS(768, 768)4.5 MB
    多层感知机
    上投影层
    3D NAND-SS(768, 3072)18 MB
    激活函数CPU/GPU
    多层感知机
    下投影层
    3D NAND-SS(3072, 768)18 MB
    多层感知机
    反量化层
    3D NAND-SS(3072, 768)18 MB
    归一化CPU/GPU
    残差连接CPU/GPU
    模型头CPU/GPU(768, 50256)
    注: 仅显示单个注意力模块的参数数量. 在实际算法中, 注意力模块通常是多层的. 对于GPT-2-124M模型, 注意力模块有12层.
    DownLoad: CSV

    表 4  3D NAND-SS计算时间仿真参数

    Table 4.  Simulation parameters for 3D NAND-SS computation time.

    参数名 参数名
    Block 建立时间/μs 7${b_{{\text{num}}}}$ X通路电流/mA 96.732
    BL切换时间/μs 13 Y通路电流/nA 150
    WL切换时间/μs 2 Vcc/V 2.5
    TSG切换时间/μs 0.8 ADC+TIA功率/mW 0.5
    TIA 转换时间/μs 0.25
    ADC转换时间/μs 0.002
    注: X通路电流指在单个Plane中建立一个Block的所有WL电压所需时间内的平均电流; Y通路电流指在单个Plane中建立一个BL所需时间的平均电流.
    DownLoad: CSV

    表 5  综合对比

    Table 5.  Benchmark.

    器件技术节点 32 nm 3D NAND[11] 40 nm 3D NAND-SS 40 nm 3D NAND-SS
    ADC精度/bit 7 7 7
    Cell精度/bit 1 1 1
    面积/mm2 17.91 40 40
    容量利用率/% 33.5 @INT8 17 @INT8 60 @INT8
    算力/TOPS 0.0018 4.57 4.57
    能耗比/(TOPS·W–1) 12.95 @INT8 5.93 @INT8 7.17 @INT8
    负载模型 ResNet-18 GPT-2-124M GPT-2-355M
    DownLoad: CSV
    Baidu
  • [1]

    Singh Parihar S, Kumar S, Chatterjee S, Pahwa G, Singh Chauhan Y, Amrouch H 2025 IEEE J. Explor. Solid-State Comput. Devices Circuits 11 34Google Scholar

    [2]

    Molom-Ochir T, Taylor B, Li H, Chen Y R 2025 IEEE Trans. Circuits Syst. I 72 3971

    [3]

    Wu B, Lv X R, Yu T Y, Chen K, Liu W Q 2025 IEEE Nanotechnol. Mag. 3 19

    [4]

    Li H W, Yao E Y, Qin P, Jiang S 2025 IEEE Trans. Magn. 61 3401306

    [5]

    Khwa W S, Wen T H, Hsu H H, Huang W H, Chang Y C, Chiu T C, Ke Z E, Chin Y H, Wen H J, Hsu W T, Lo C C, Liu R S, Hsieh C C, Tang K T, Ho M S, Lele A S, Teng S H, Chou C C, Chih Y D, Chang T Y J, Chang M F 2025 Nature 639 617Google Scholar

    [6]

    Sharma V, Zhang X, Dhakad N S, Kim T T H 2025 IEEE Trans. Circuits Syst. I 72 5696

    [7]

    Liu S Q, Wei S T, Yao P, Wu D, Jie L, Pan S N, Tang J S, Gao B, Qian H, Wu H Q 2025 J. Semicond. 46 062304Google Scholar

    [8]

    Chang S H, Yen R H, Liu C N 2025 ACM J. Emerg. Technol. Comput. Syst. 21 4

    [9]

    张宇琦, 王俊杰, 吕子玉, 韩素婷 2022 Acta Phys. Sin. 71 148502Google Scholar

    Zhang Y Q, Wang J J, Lyv Z Y, Han S T 2022 Acta Phys. Sin. 71 148502Google Scholar

    [10]

    Shim W, Yu S 2021 IEEE J. Explor. Solid-State Comput. Devices Circuits 7 1Google Scholar

    [11]

    Hong Y, Kim M, Kim C 2025 techrxiv: 174439324.42202505

    [12]

    Shim W, Yu S M 2021 IEEE J. Explor. Solid-State Comput. Devices Circuits 7 61Google Scholar

    [13]

    Shim W, Yu S M 2021 IEEE Electron Device Lett. 42 160Google Scholar

    [14]

    陈阳洋, 何毓辉, 缪向水, 杨道虹 2022 Acta Phys. Sin. 71 210702Google Scholar

    Chen Y Y, He Y H, Miao X S, Yang D H 2022 Acta Phys. Sin. 71 210702Google Scholar

    [15]

    Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł ukasz, Polosukhin I 2017 Advances in Neural Information Processing Systems (Curran Associates, Inc) 2017 p5998

    [16]

    Hanna M, Liu O, Variengien A 2023 Adv. Neural Inf. Process. Syst. 36 76033

    [17]

    Lue H T, Hsu P K, Wei M L, Yeh T H, Du P Y, Chen W C, Wang K C, Lu C Y 2019 2019 IEEE International Electron Devices Meeting (IEDM) San Francisco, USA, December 9-11, 2019 p38.1. 1

    [18]

    Kim M, Liu M, Everson L, Park G, Jeon Y, Kim S, Lee S, Song S, Kim C H 2019 2019 IEEE International Electron Devices Meeting (IEDM) San Francisco, USA, December 9-11, 2019 p38.3. 1

    [19]

    Kang M, Kim H, Shin H, Sim J, Kim K, Kim L S 2022 IEEE Trans. Comput. 71 1291

    [20]

    Lee S T, Yeom G, Yoo H, Kim H S, Lim S, Bae J H, Park B G, Lee J H 2021 IEEE Trans. Electron Devices 68 3365Google Scholar

    [21]

    Lee S T, Lee J H 2020 Front. Neurosci. 14 571292Google Scholar

    [22]

    Wong R, Kim N, Higgs K, Agarwal S, Ipek E, Ghose S, Feinberg B 2024 arXiv: 2403.06938

    [23]

    Hsieh C C, Lue H T, Li Y C, Hung S N, Hung C H, Wang K C, Lu C Y 2023 2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits) Kyoto, Japan, June 11–16, 2023 p1

  • [1] Li Hongde, Zhang Hong, Jiao Yang, Lei Zhifeng, Yang Weikun, Li Hui, Lu Guoguang, Zhang Zhangang. Characteristics and Mechanisms of Single Event Upset Induced by Atmospheric Neutrons in Charge Trap 3D NAND Flash Memory. Acta Physica Sinica, doi: 10.7498/aps.75.20251123
    [2] Fang Yu-Xuan, Xia Zhi-Liang, Yang Tao, Zhou Wen-Xi, Huo Zong-Liang. Improvement of fluorine attack induced word-line leakage in 3D NAND flash memory. Acta Physica Sinica, doi: 10.7498/aps.73.20231557
    [3] Fang Yu-Xuan, Yang Yi, Xia Zhi-Liang, Huo Zong-Liang. First-principles study of F adsorption by TiN with its oxide surface in three-dimensional NAND flash memory. Acta Physica Sinica, doi: 10.7498/aps.73.20240254
    [4] Chen Yang-Yang, He Yu-Hui, Miao Xiang-Shui, Yang Dao-Hong. 3D-NAND flash memory based neuromorphic computing. Acta Physica Sinica, doi: 10.7498/aps.71.20220974
    [5] Zhang Yu-Qi, Wang Jun-Jie, Lü Zi-Yu, Han Su-Ting. Multimode modulated memristors for in-sensor computing system. Acta Physica Sinica, doi: 10.7498/aps.71.20220226
    [6] Wu Chang-Chun, Zhou Pu-Jun, Wang Jun-Jie, Li Guo, Hu Shao-Gang, Yu Qi, Liu Yang. Memristor based spiking neural network accelerator architecture. Acta Physica Sinica, doi: 10.7498/aps.71.20220098
    [7] Shan Xuan-Yu, Wang Zhong-Qiang, Xie Jun, Zheng Jia-Hui, Xu Hai-Yang, Liu Yi-Chun. Recent progress in optoelectronic memristive devices for in-sensor computing. Acta Physica Sinica, doi: 10.7498/aps.71.20220350
    [8] Wang Tong, Wen Juan, Lü Kang, Chen Jian-Zhong, Wang Liang, Guo Xin. Bio-inspired sensory systems with integrated capabilities of sensing, data storage, and processing. Acta Physica Sinica, doi: 10.7498/aps.71.20220281
    [9] Wu Xiao-Yu, Zhao Hu, Li Zhi. Three-dimensional transmon coherence measurement method based on network analyser. Acta Physica Sinica, doi: 10.7498/aps.69.20200252
    [10] Hou Zhi-Shan, Xu Shuai, Luo Yang, Li Ai-Wu, Yang Han. Femtosecond laser 3D printing temperature sensitive microsphere lasers. Acta Physica Sinica, doi: 10.7498/aps.68.20190298
    [11] Xiong Yi-Jun, Wang Yan, Wang Qiang, Wang Chun-Qi, Huang Xiao-Zhong, Zhang Fen, Zhou Ding. Structural broadband absorbing metamaterial based on three-dimensional printing technology. Acta Physica Sinica, doi: 10.7498/aps.67.20172262
    [12] Liao Jian, Xie Zhao-Qi, Yuan Jian-Mei, Huang Yan-Ping, Mao Yu-Liang. First-principles study of 3d transition metal Co doped core-shell silicon nanowires. Acta Physica Sinica, doi: 10.7498/aps.63.163101
    [13] Wang Zhen, Li Yong-Xin, Xi Xiao-Jian, Lü Lei. Heteoclinic orbit and backstepping control of a 3D chaotic system. Acta Physica Sinica, doi: 10.7498/aps.60.010513
    [14] Shang Jia-Xiang, Yu Xian-Yang. The site preference of 3d transition metals in NiAl and its effects on bond characters. Acta Physica Sinica, doi: 10.7498/aps.57.2380
    [15] Zhao Zong-Yan, Liu Qing-Ju, Zhang Jin, Zhu Zhong-Qi. First-principles study of 3d transition metal-doped anatase. Acta Physica Sinica, doi: 10.7498/aps.56.6592
    [16] Zhao Xin-Xin, Tao Xiang-Ming, Chen Wen-Bin, Cai Jian-Qiu, Tan Ming-Qiu. Magnetism of 3d transition metal monolayers on Pd(001) surface: density functional theory study. Acta Physica Sinica, doi: 10.7498/aps.54.5849
    [17] Lü Jin, Xu Xiao-Hong, Wu Hai-Shun. Structure and magnetism of 3d series (TM)4 clusters. Acta Physica Sinica, doi: 10.7498/aps.53.1050
    [18] ZHOU YI-YANG. . Acta Physica Sinica, doi: 10.7498/aps.44.122
    [19] ZHANG QIANG-JI, CHEN NAI-QUN, HUA ZHONG-YI. INVESTIGATION OF 3d TRANSITION METALS BY IONIZATION LOSS SPECTROSCOPY. Acta Physica Sinica, doi: 10.7498/aps.40.1344
    [20] GU YI-MING, HUANG MING-ZHU, WANG KE LING. ELECTRONIC STRUCTURES OF 3d-TRANSITION METAL IN GaAs1-xPx ALLOY SYSTEM. Acta Physica Sinica, doi: 10.7498/aps.37.11
Metrics
  • Abstract views:  641
  • PDF Downloads:  46
  • Cited By: 0
Publishing process
  • Received Date:  08 July 2025
  • Accepted Date:  01 October 2025
  • Available Online:  10 October 2025
  • /

    返回文章
    返回
    Baidu
    map