搜索

x

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

一种基于3D NAND存储器的存算一体架构及其系统技术协同优化仿真

郑好 刘慧雯 方语萱 范冬宇 韩玉辉 侯春源 刘威 夏志良 霍宗亮

引用本文:
Citation:

一种基于3D NAND存储器的存算一体架构及其系统技术协同优化仿真

郑好, 刘慧雯, 方语萱, 范冬宇, 韩玉辉, 侯春源, 刘威, 夏志良, 霍宗亮

A compute-in-memory architecture and system-technology codesign simulator based on 3D NAND flash

ZHENG Hao, LIU Huiwen, FANG Yuxuan, FAN Dongyu, HAN Yuhui, HOU Chunyuan, LIU Wei, XIA Zhiliang, HUO Zongliang
Article Text (iFLYTEK Translation)
PDF
HTML
导出引用
在线预览
  • 随着ChatGPT等大语言模型的发展, 产业界对硬件的算力、容量和功耗提出了新的需求. 存算一体(compute-in-memory, CIM)技术相较于传统近存计算, 减少了数据搬移, 显著降低功耗. 而在众多存储器中, 3D NAND闪存因其成熟的工艺制造技术和超高容量, 是最有可能实现大模型本地部署的候选方案. 然而, 目前针对3D NAND闪存CIM芯片的研究大多停留在学术研究阶段, 未基于产品级3D NAND闪存芯片进行系统性的CIM架构设计和大模型功能验证. 对此, 本文搭建了基于PyTorch框架的大语言模型仿真器平台来评估系统架构的性能, 并提出了一种基于源线背面切分工艺的通用3D NAND架构. 该架构通过改动3D NAND的源线制造工艺以支持CIM计算, 工艺成本极低, 可供产业界快速迭代, 并完善了相应的映射算法和流水线设计. 最后通过仿真器平台对所提出的架构在电流分布和量化的影响下进行了性能评估, 仿真结果表明所设计的产品级3D NAND芯片可以在GPT-2-124M大模型上做到20 tokens/s的生成速度和5.93 TOPS/W的能效比, 在GPT-2-355M大模型上做到8.5 tokens/s的生成速度和7.17 TOPS/W的能效比.
    The rapid advancement of large language models (LLM) such as ChatGPT has imposed unprecedented demands on hardware in terms of computational power, memory capacity, and energy efficiency. Compute-in-memory (CIM) technology, which integrates computing directly into memory arrays, has become a promising solution that can overcome the data movement bottlenecks of traditional von Neumann architectures, significantly reduce power consumption and achieve large-scale parallel processing. Among various non-volatile memory candidates, 3D NAND flash stands out due to its mature manufacturing process, ultrahigh density, and cost-effectiveness, making it a strong contender for commercial CIM deployment and local inference of large models.Despite these advantages, most of existing researches on 3D NAND-based CIM remain at an academic level, focusing on theoretical designs or small-scale prototypes, with little attention paid to system-level architecture design and functional validation using product-grade 3D NAND chips for LLM applications. To address this gap, we propose a novel CIM architecture based on 3D NAND flash, which utilizes a source line (SL) slicing technique to partition the array and perform parallel computation at minimal manufacturing cost. This architecture is complemented by an efficient mapping algorithm and pipelined dataflow, enabling system-level simulation and rapid industrial iteration.We develop a PyTorch-based behavioral simulator for LLM inference on the proposed hardware, evaluating the influences of current distribution and quantization on system performance. Our design supports INT4/INT8 quantization and employs dynamic weight storage logic to minimize voltage switching overhead, and is further optimized through hierarchical pipelining to maximize throughput under hardware constraints.Simulation results show that our simulation-grade 3D NAND compute-in-memory chip reaches generation speeds of 20 tokens/s with an energy efficiency of 5.93 TOPS/W on GPT-2-124M and 8.5 tokens/s with 7.17 TOPS/W on GPT-2-355M, respectively, while maintaining system-level reliability for open-state current distributions with σ < 2.5 nA; in INT8 mode, quantization error is the dominant accuracy bottleneck.Compared with previous CIM solutions, our architecture supports larger model loads, higher computational precision, and significantly reduced power consumption, as evidenced by comprehensive benchmarking. The SL slicing technique keeps array wastage below 3%, while hybrid wafer-bonding integrates high-density ADC/TIA circuits to improve hardware resource utilization.This work represents the first system-level simulation of LLM inference on product-grade 3D NAND CIM hardware, providing a standardized and scalable reference for future commercialization. The complete simulation framework is released on GitHub to facilitate further research and development. Future work will focus on device-level optimization of 3D NAND and iterative improvements of the simulator algorithm.
  • 图 1  GPT-2-124M 模型算法示意图 (a) GPT-2-124M 整体算法流程; (b) 掩码多头注意力计 算流程; (c) 前向全连接层计算流程

    Fig. 1.  Schematic diagram of the GPT-2-124M model algorithm: (a) Overall algorithm flow of GPT-2-124M; (b) masked multi-head attention computation flow; (c) forward fully connected layer computation flow

    图 2  3D NAND闪存的向量矩阵乘法操作及其电路示意图

    Fig. 2.  Vector-matrix multiplication operation and its circuit schematic of 3D NAND flash memory.

    图 3  3D NAND-SS架构完整示意图 (a) 系统架构示意图; (b) 控制器数据指令流; (c) 3D NAND-SS阵列示意图

    Fig. 3.  The 3D NAND-SS architecture: (a) System architecture diagram; (b) controller data and instruction flow; (c) 3D NAND-SS array diagram.

    图 4  单个Block的源线N型多晶硅区域经切分后形成的SL分区俯视图/斜视图

    Fig. 4.  Top and oblique views of the SL partitions formed by segmenting the source-line N-type polysilicon region in a single block.

    图 5  3D NAND-SS 输入和权重映射方案, 输入和权重均被量化成多个1 bit数据沿着横纵两个方向排布

    Fig. 5.  The 3D NAND-SS input and weight mapping scheme: Both inputs and weights are quantized into multiple 1-bit data arranged in horizontal and vertical directions.

    图 6  3D NAND-SS 计算过程阵列示意图

    Fig. 6.  3D NAND-SS computational process array schematic diagram.

    图 7  3D NAND-SS计算流水线设计, 处理多TIA复用ADC的情况

    Fig. 7.  The 3D NAND-SS array computation pipeline design for handling multiple TIA multiplexed ADCs.

    图 8  (a) 不同量化数下模型token概率(提示词: “How is the weather today?”); (b) 不同开态电流分布下GPT-2-124M模型输出差值概率分布

    Fig. 8.  (a) Token probabilities of the model under different quantization bit widths (prompt: “How is the weather today?”); (b) output probability distribution of the GPT-2-124M model under different open-state current distributions.

    图 9  采用非对称量化和对称量化下的多层感知机下投影层计算误差对比图

    Fig. 9.  Comparison of computation errors in the MLP down-projection layer under asymmetric and symmetric quantization.

    图 10  (a) The 3D NAND-SS架构生成10 Tokens的时间组成; (b) 3D NAND-SS架构生成10 Tokens的功耗组成

    Fig. 10.  (a) Time composition for generating 10 Tokens in the 3D NAND-SS architecture; (b) power composition for generating 10 Tokens in the 3D NAND-SS architecture.

    表 1  仿真平台部分参数表

    Table 1.  Partial parameter table of the simulation platform.

    参数名功能参数名功能
    Quantization bits量化数Block setup time时间常数
    Current mean /Scale器件开态电流分布均值/标准差WL switch time时间常数
    Blocks/Operation单次计算操作的Block数量TSG switch time时间常数
    Max current sum单次计算求和的电流数BL switch time时间常数
    Symmetric mode是否采用对称量化TIA conversion time时间常数
    ADC multiplexing factorADC复用数ADC conversion time时间常数
    X path current横向通道电流Planes/Die硬件常数
    Y Path current纵向通道电流Layers/Die硬件常数
    Ve电压Blocks/Plane硬件常数
    Background current背景电流TSGs/Block硬件常数
    Num of TIAsTIA数量Bit lines/Plane硬件常数
    下载: 导出CSV

    表 2  3D NAND-SS 硬件参数

    Table 2.  3D NAND-SS hardware configuration.

    硬件参数名 硬件参数名
    Plane数每芯片 4 Layer数每芯片 32
    Block数每Plane 216 纵向切分数 216
    TSG数每Block 10 横向切分数 1024
    BL数每Plane 131072 ADC最大分辨率 128
    *缩减层数用于简化仿真, 实际产品为128层
    下载: 导出CSV

    表 3  GPT-2-124M模型参数

    Table 3.  GPT-2-124M model parameters.

    模型层名计算硬件参数形状参数量(INT8)
    嵌入层CPU/GPU(50256, 768)
    QKV投影层3D NAND-SS(768, 2304)13.5 MB
    注意力矩阵
    计算
    CPU/GPU(序列长度, 768)
    注意力矩阵
    投影
    3D NAND-SS(768, 768)4.5 MB
    多层感知机
    上投影层
    3D NAND-SS(768, 3072)18 MB
    激活函数CPU/GPU
    多层感知机
    下投影层
    3D NAND-SS(3072, 768)18 MB
    多层感知机
    反量化层
    3D NAND-SS(3072, 768)18 MB
    归一化CPU/GPU
    残差连接CPU/GPU
    模型头CPU/GPU(768, 50256)
    注: 仅显示单个注意力模块的参数数量. 在实际算法中, 注意力模块通常是多层的. 对于GPT-2-124M模型, 注意力模块有12层.
    下载: 导出CSV

    表 4  3D NAND-SS计算时间仿真参数

    Table 4.  Simulation parameters for 3D NAND-SS computation time.

    参数名 参数名
    Block 建立时间/μs 7${b_{{\text{num}}}}$ X通路电流/mA 96.732
    BL切换时间/μs 13 Y通路电流/nA 150
    WL切换时间/μs 2 Vcc/V 2.5
    TSG切换时间/μs 0.8 ADC+TIA功率/mW 0.5
    TIA 转换时间/μs 0.25
    ADC转换时间/μs 0.002
    注: X通路电流指在单个Plane中建立一个Block的所有WL电压所需时间内的平均电流; Y通路电流指在单个Plane中建立一个BL所需时间的平均电流.
    下载: 导出CSV

    表 5  综合对比

    Table 5.  Benchmark.

    器件技术节点 32 nm 3D NAND[11] 40 nm 3D NAND-SS 40 nm 3D NAND-SS
    ADC精度/bit 7 7 7
    Cell精度/bit 1 1 1
    面积/mm2 17.91 40 40
    容量利用率/% 33.5 @INT8 17 @INT8 60 @INT8
    算力/TOPS 0.0018 4.57 4.57
    能耗比/(TOPS·W–1) 12.95 @INT8 5.93 @INT8 7.17 @INT8
    负载模型 ResNet-18 GPT-2-124M GPT-2-355M
    下载: 导出CSV
    Baidu
  • [1]

    Singh Parihar S, Kumar S, Chatterjee S, Pahwa G, Singh Chauhan Y, Amrouch H 2025 IEEE J. Explor. Solid-State Comput. Devices Circuits 11 34Google Scholar

    [2]

    Molom-Ochir T, Taylor B, Li H, Chen Y R 2025 IEEE Trans. Circuits Syst. I 72 3971

    [3]

    Wu B, Lv X R, Yu T Y, Chen K, Liu W Q 2025 IEEE Nanotechnol. Mag. 3 19

    [4]

    Li H W, Yao E Y, Qin P, Jiang S 2025 IEEE Trans. Magn. 61 3401306

    [5]

    Khwa W S, Wen T H, Hsu H H, Huang W H, Chang Y C, Chiu T C, Ke Z E, Chin Y H, Wen H J, Hsu W T, Lo C C, Liu R S, Hsieh C C, Tang K T, Ho M S, Lele A S, Teng S H, Chou C C, Chih Y D, Chang T Y J, Chang M F 2025 Nature 639 617Google Scholar

    [6]

    Sharma V, Zhang X, Dhakad N S, Kim T T H 2025 IEEE Trans. Circuits Syst. I 72 5696

    [7]

    Liu S Q, Wei S T, Yao P, Wu D, Jie L, Pan S N, Tang J S, Gao B, Qian H, Wu H Q 2025 J. Semicond. 46 062304Google Scholar

    [8]

    Chang S H, Yen R H, Liu C N 2025 ACM J. Emerg. Technol. Comput. Syst. 21 4

    [9]

    张宇琦, 王俊杰, 吕子玉, 韩素婷 2022 Acta Phys. Sin. 71 148502Google Scholar

    Zhang Y Q, Wang J J, Lyv Z Y, Han S T 2022 Acta Phys. Sin. 71 148502Google Scholar

    [10]

    Shim W, Yu S 2021 IEEE J. Explor. Solid-State Comput. Devices Circuits 7 1Google Scholar

    [11]

    Hong Y, Kim M, Kim C 2025 techrxiv: 174439324.42202505

    [12]

    Shim W, Yu S M 2021 IEEE J. Explor. Solid-State Comput. Devices Circuits 7 61Google Scholar

    [13]

    Shim W, Yu S M 2021 IEEE Electron Device Lett. 42 160Google Scholar

    [14]

    陈阳洋, 何毓辉, 缪向水, 杨道虹 2022 Acta Phys. Sin. 71 210702Google Scholar

    Chen Y Y, He Y H, Miao X S, Yang D H 2022 Acta Phys. Sin. 71 210702Google Scholar

    [15]

    Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł ukasz, Polosukhin I 2017 Advances in Neural Information Processing Systems (Curran Associates, Inc) 2017 p5998

    [16]

    Hanna M, Liu O, Variengien A 2023 Adv. Neural Inf. Process. Syst. 36 76033

    [17]

    Lue H T, Hsu P K, Wei M L, Yeh T H, Du P Y, Chen W C, Wang K C, Lu C Y 2019 2019 IEEE International Electron Devices Meeting (IEDM) San Francisco, USA, December 9-11, 2019 p38.1. 1

    [18]

    Kim M, Liu M, Everson L, Park G, Jeon Y, Kim S, Lee S, Song S, Kim C H 2019 2019 IEEE International Electron Devices Meeting (IEDM) San Francisco, USA, December 9-11, 2019 p38.3. 1

    [19]

    Kang M, Kim H, Shin H, Sim J, Kim K, Kim L S 2022 IEEE Trans. Comput. 71 1291

    [20]

    Lee S T, Yeom G, Yoo H, Kim H S, Lim S, Bae J H, Park B G, Lee J H 2021 IEEE Trans. Electron Devices 68 3365Google Scholar

    [21]

    Lee S T, Lee J H 2020 Front. Neurosci. 14 571292Google Scholar

    [22]

    Wong R, Kim N, Higgs K, Agarwal S, Ipek E, Ghose S, Feinberg B 2024 arXiv: 2403.06938

    [23]

    Hsieh C C, Lue H T, Li Y C, Hung S N, Hung C H, Wang K C, Lu C Y 2023 2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits) Kyoto, Japan, June 11–16, 2023 p1

  • [1] 李鸿德, 张鸿, 焦扬, 雷志锋, 杨炜坤, 李惠, 路国光, 张战刚. 大气中子在电荷俘获型3D NAND闪存中引起的单粒子翻转特性及机理研究.  , doi: 10.7498/aps.75.20251123
    [2] 方语萱, 夏志良, 杨涛, 周文犀, 霍宗亮. 3D NAND闪存中氟攻击问题引起的字线漏电的改进.  , doi: 10.7498/aps.73.20231557
    [3] 方语萱, 杨益, 夏志良, 霍宗亮. 3D NAND闪存中TiN与氧化表面F吸附作用的第一性原理研究.  , doi: 10.7498/aps.73.20240254
    [4] 陈阳洋, 何毓辉, 缪向水, 杨道虹. 基于3D-NAND的神经形态计算.  , doi: 10.7498/aps.71.20220974
    [5] 张宇琦, 王俊杰, 吕子玉, 韩素婷. 应用于感存算一体化系统的多模调控忆阻器.  , doi: 10.7498/aps.71.20220226
    [6] 武长春, 周莆钧, 王俊杰, 李国, 胡绍刚, 于奇, 刘洋. 基于忆阻器的脉冲神经网络硬件加速器架构设计.  , doi: 10.7498/aps.71.20220098
    [7] 单旋宇, 王中强, 谢君, 郑嘉慧, 徐海阳, 刘益春. 面向感存算一体化的光电忆阻器件研究进展.  , doi: 10.7498/aps.71.20220350
    [8] 王童, 温娟, 吕康, 陈健中, 汪亮, 郭新. 仿生生物感官的感存算一体化系统.  , doi: 10.7498/aps.71.20220281
    [9] 吴小宇, 赵虎, 李智. 基于网络分析仪的3D Transmon相干测量方法.  , doi: 10.7498/aps.69.20200252
    [10] 侯智善, 徐帅, 骆杨, 李爱武, 杨罕. 激光3D纳米打印温度敏感的微球激光器.  , doi: 10.7498/aps.68.20190298
    [11] 熊益军, 王岩, 王强, 王春齐, 黄小忠, 张芬, 周丁. 一种基于3D打印技术的结构型宽频吸波超材料.  , doi: 10.7498/aps.67.20172262
    [12] 廖建, 谢召起, 袁健美, 黄艳平, 毛宇亮. 3d过渡金属Co掺杂核壳结构硅纳米线的第一性原理研究.  , doi: 10.7498/aps.63.163101
    [13] 王震, 李永新, 惠小健, 吕雷. 一类3D混沌系统的异宿轨道和backstepping控制.  , doi: 10.7498/aps.60.010513
    [14] 尚家香, 喻显扬. 3d过渡金属在NiAl中的占位及对键合性质的影响.  , doi: 10.7498/aps.57.2380
    [15] 赵宗彦, 柳清菊, 张 瑾, 朱忠其. 3d过渡金属掺杂锐钛矿相TiO2的第一性原理研究.  , doi: 10.7498/aps.56.6592
    [16] 赵新新, 陶向明, 陈文彬, 蔡建秋, 谭明秋. 3d过渡金属原子单层在Pd(001)表面磁性的第一性原理研究.  , doi: 10.7498/aps.54.5849
    [17] 吕瑾, 许小红, 武海顺. 3d系列 (TM)4 团簇的结构和磁性.  , doi: 10.7498/aps.53.1050
    [18] 周一阳. 自旋三重态对3d~4/3d~6离子零场分裂参量的影响.  , doi: 10.7498/aps.44.122
    [19] 张强基, 陈乃群, 华中一. 3d 金属电离损失谱研究.  , doi: 10.7498/aps.40.1344
    [20] 顾一鸣, 黄明竹, 汪克林. GaAs1-xPx中3d过渡金属杂质的电子结构.  , doi: 10.7498/aps.37.11
计量
  • 文章访问数:  640
  • PDF下载量:  46
  • 被引次数: 0
出版历程
  • 收稿日期:  2025-07-08
  • 修回日期:  2025-10-01
  • 上网日期:  2025-10-10

/

返回文章
返回
Baidu
map