New ISSCC 2020 / SESSION 33 / NON-VOLATILE DEVICES FOR … · 2020. 4. 12. · ISSCC 2020 / SESSION...

3
500 2020 IEEE International Solid-State Circuits Conference ISSCC 2020 / SESSION 33 / NON-VOLATILE DEVICES FOR FUTURE ARCHITECTURE / 33.2 33.2 A Fully Integrated Analog ReRAM Based 78.4TOPS/W Compute-In-Memory Chip with Fully Parallel MAC Computing Qi Liu 1 , Bin Gao 1 , Peng Yao 1 , Dong Wu 1 , Junren Chen 1 , Yachuan Pang 1 , Wenqiang Zhang 1 , Yan Liao 1 , Cheng-Xin Xue 2 , Wei-Hao Chen 2 , Jianshi Tang 1 , Yu Wang 1 , Meng-Fan Chang 2 , He Qian 1 , Huaqiang Wu 1 1 Tsinghua University, Beijing, China 2 National Tsing Hua University, Hsinchu, Taiwan Non-volatile memory (NVM) based computing-in-memory (CIM) shows significant advantages in handling deep learning tasks for artificial intelligence (AI) applications. To overcome the decreasing cost effectiveness of transistor scaling and the intrinsic inefficiency of data-shuttling in the von-Neumann architecture, CIM is proposed to realize high-speed and low-power system with parallel multiplication accumulation (MAC) computing [1][2]. However, current demonstrations are mainly based on single macro and present limited computing parallelism. Realizing a fully-integrated CIM chip with a complete neural network model is still missing. The major challenges lie in: (1) The IR drop and transient errors when carrying out MAC operations in non-volatile memory arrays decrease the computing accuracy and further limit the parallelism; (2) The inefficiency of the interface blocks between different arrays due to the power overhead of the A/D and D/A converters (shown in Fig. 33.2.1). To address these challenges, this work proposes: (1) A sign-weighted 2T2R (SW-2T2R) array to reduce IR drop by decreasing the accumulative SL current (ISL), and eventually boost the computing parallelism; (2) a low-power interface design with resolution-adjustable LPAR-ADC to realize flexible tradeoff between system accuracy and power consumption. In this manner, this work implements a fully-integrated 784-100-10 MLP model on an integrated CIM chip with158.8kb analog ReRAMs. This chip realizes high recognition accuracy (94.4%) on MNIST database, high inference speed (77 μs/Image), and 78.4 TOPS/W peak energy efficiency. The CMOS circuits are fabricated in a 130nm process. Figure 33.2.2 presents the algorithm, structure and work flow diagram regarding the proposed ReRAM-based CIM chip. This work realizes a two-layer perceptron model, which consists of two fully-connected weight arrays and three neural layers. Accordingly, the chip structure is composed of a SW-2T2R array, a 1T1R array, input/output buffers, LPAR-ADCs, etc. In the SW-2T2R array, the positive weight and negative weight in a differential device pair are connected on the same output column, which is different from Ref. [2] or [3]. An x-bit signed weight (1-bit sign, x-1 -bit data) is stored in a SW-2T2R cell. During n parallel MAC operations, n LPAR-ADCs clamp the SLs to voltage (VSL CLP ) and convert SL currents to digital outputs. The SL current is the accumulation result of all the SW-2T2R cell currents on a same column. Each MAC operation evaluates the product of one m- dimensional 1bit-input vector and an x-bit sign-weight vector. The output of LPAR-ADC is stored in registers and sampled simultaneously to the next ReRAM array as the input data. The resolution of LPAR-ADC is adjustable by changing the sampling clock frequency. The flexible configuration of the interface block helps to achieve the balance between the system accuracy and power consumption. The output of the second ReRAM array i.e. MAC2-OUT, is sampled by counters and stored in output buffers (shown in Fig. 33.2.2). If the resolution of the 1 st -stage ADC is configured as N1 bit, it will generate 2 N1 pulses. Similarly, if the resolution of the 2 nd -stage ADC is set as N2 bit, for each pulse of the 1 st -stage output, the 2 nd - stage ADC generates 2 N2 pulses. Thus, the inference time of one image will last for 2 (N1+N2) cycles at least. A higher ADC resolution could lead to better system recognition accuracy, while consuming more energy and latency. Figure 33.2.3 presents the structure, operating timing diagram and truth table of SW-2T2R array. In a SW-2T2R cell, two ReRAMs represent positive and negative weights by utilizing opposite voltage polarity during inference stage. If VSL = V CLP , VBLP = V CLP – V READ, VBLN = V CLP + V READ, G POS and G NEG would represent the positive and negative weight respectively. Eventually, the equivalent weight of this SW- 2T2R cell is W CELL (=G POS -G NEG ), which could be positive, negative or zero. The SL current accounted for this weight pair is equal to the differential currents flowing through the positive cell and the negative cell, respectively. This current is proportional to W CELL according to I CELL = V READ * W CELL . The 2T2R structure is designed to improve CIM accuracy by reducing the IR drop from two aspects: (a) if G POS = G NEG , I CELL can be reduced to zero; (b) The current through the positive weight and negative weight on the same column can be cancelled out locally. ReRAM precision determines the weight precision of a SW-2T2R cell, shown in the truth tables of Fig. 33.2.2. The ReRAM-based weight presentation is defined according to the device intermediate states. If single ReRAM device works as a 1- bit (2 device levels) or 2-bit (4 device levels), the weight precision of a SW-2T2R cell is signed quasi-2-bit (3-level) or signed quasi-3-bit (7-level) accordingly. According to the off-chip test on the same ReRAM stacks, the device conductance could be tuned continuously. However, the on-chip ADC resolution limits that the on-chip ReRAM conductance could be quantified with 256 states at most. Figure 33.2.4 shows the structure and timing diagram of LPAR-ADC. LPAR-ADC is composed of three sub-modules: an integrator, a comparator and a segmented- capacitor DAC (SC-DAC). The integrator consists of an operational amplifier (OPA) and an integrating capacitor. The integrator clamps the SL to V CLP and converts the SL current to an analog voltage signal. The SC-DAC generates a ramp voltage signal from V CLP to V DD . The comparator is used to compare the ramp voltage signal and the integrated voltage signal. The ADC workflow includes three phases: (1) PH1: Reset ADC via keeping the RST_integ/EN_integ switch ‘ON’, and the EN_DAC/EN_comp switch ‘OFF’. In this phase, SL and OUT_integ are clamped to V CLP , and V RAMP remains at the initial voltage, i.e., V CLP . (2) PH2: Sample I SL . In this phase, the RST_integ switch is cut off and Cinteg is accessed to the SL current. Charges from the SL and OPA are accumulated on the capacitor. The voltage of OUT_integ is then changed accordingly. EN_comp is enabled to pre-charge the comparator. (3) PH3: MAC-OUT. In this phase, the EN_integ switches off to cut off the SL current read-path. Meanwhile, the voltage of OUT_integ is maintained. SC-DAC starts to count and generates a ramp voltage signal. The comparator compares the output of the integrator and SC-DAC, and generates a spike pulse. As illustrated in Fig. 33.2.2, the power consumption of LPAR-ADC is controlled by the reference current source of the integrator and comparator, and the ADC resolution is configured by setting the frequency of the sampling clock. It is worth mentioning that integration and quantization method could filter out the current overshoot and fluctuation by averaging the accumulative ISL over the integrating period. The quantized output minimizes the transient errors. Figure 33.2.5 shows the experimental results on access time, power consumption, accuracy and speed during inference on the MNIST dataset. In the case of V DD = 4.2V and V READ = 0.2V, MAC-OUT access time is 51.1ns. The test result shows that the power consumption of SW-2T2R CIM chip is 1.9× lower than that of a 1T1R version. All the data are obtained based on a same CIM chip with different weight structure. In addition, the test results show that the recognition accuracy of MNIST dataset is positively correlated with the ADC resolution of both stages, while the inference speed is positively correlated with the resolution of the 1 st stage ADC. When the resolution of the 1 st /2 nd stage ADC is configured as 2bit/8bit, the recognition accuracy is ~92% and the inference speed is 77 μs/Image. Figure 33.2.6 shows the test system and software interface when testing the CIM chip. The test system includes a FPGA board, a test chip board and a host computer. The FPGA is used to move data and commands between the host computer and the CIM chip. The simulation result of the 784-100-10 fully- connected NN shows that the SW-2T2R structure can effectively reduce accuracy loss due to the IR drop. Using 3-bit signed weight, the test result reaches an accuracy of 93.4%, which is ~2% lower than the simulation result. The comparison results between this work and prior works are summarized. This work has achieved better performance under less advanced technology in terms of peak energy efficiency, MNIST dataset recognition accuracy, ADC resolution and the inference speed. Figure 33.2.7 shows the die photomicrograph, the layout of a SW-2T2R cell, and a specific feature table. In summary, this work implements a 158.8Kb ReRAM CIM chip in 130nm CMOS process. For the first time, a CIM chip is fully integrated for a complete multi-layer NN model, and recognizes the MNIST images at a high speed of 77 μs/Image, 78.4 TOPS/W peak energy efficiency and 94.4% test accuracy. Acknowledgements: This work is supported in part by the National Natural Science Foundation of China (61851404), National Key R&D Program of China (2016YFA0201801), Beijing Municipal Science and Technology Project (Z191100007519008), Huawei Project (YBN2019075015), Tsinghua and National Tsinghua joint project, and Beijing Innovation Center for Future Chips (ICFC). References: [1] W.-H. Chen et al., "A 65nm 1Mb Nonvolatile Computing-In-Memory ReRAM Macro with Sub-16ns Multiply-and-Accumulate for Binary DNN AI Edge Processors," ISSCC, pp. 494-496, Feb. 2018. [2] R. Mochida et al., “A 4M Synapses Integrated Analog ReRAM Based 66.5 TOPS/W Neural-Network Processor with Cell Current Controlled Writing and Flexible Network Architecture,” VLSI, pp. 175-176, 2018. [3] C.-X. Xue et al., " A 1Mb Multibit ReRAM Computing-In-Memory Macro with 14.6ns Parallel MAC Computing Time for CNN Based AI Edge Processors," ISSCC, pp. 388-390, Feb. 2019. 978-1-7281-3205-1/20/$31.00 ©2020 IEEE

Transcript of New ISSCC 2020 / SESSION 33 / NON-VOLATILE DEVICES FOR … · 2020. 4. 12. · ISSCC 2020 / SESSION...

Page 1: New ISSCC 2020 / SESSION 33 / NON-VOLATILE DEVICES FOR … · 2020. 4. 12. · ISSCC 2020 / SESSION 33 / NON-VOLATILE DEVICES FOR FUTURE ARCHITECTURE / 33.2 33.2 A Fully Integrated

500 • 2020 IEEE International Solid-State Circuits Conference

ISSCC 2020 / SESSION 33 / NON-VOLATILE DEVICES FOR FUTURE ARCHITECTURE / 33.2

33.2 A Fully Integrated Analog ReRAM Based 78.4TOPS/W Compute-In-Memory Chip with Fully Parallel MAC ComputingQi Liu1, Bin Gao1, Peng Yao1, Dong Wu1, Junren Chen1, Yachuan Pang1, Wenqiang Zhang1, Yan Liao1, Cheng-Xin Xue2, Wei-Hao Chen2, Jianshi Tang1, Yu Wang1, Meng-Fan Chang2, He Qian1, Huaqiang Wu1

1Tsinghua University, Beijing, China2National Tsing Hua University, Hsinchu, Taiwan

Non-volatile memory (NVM) based computing-in-memory (CIM) shows significantadvantages in handling deep learning tasks for artificial intelligence (AI)applications. To overcome the decreasing cost effectiveness of transistor scalingand the intrinsic inefficiency of data-shuttling in the von-Neumann architecture,CIM is proposed to realize high-speed and low-power system with parallelmultiplication accumulation (MAC) computing [1][2]. However, currentdemonstrations are mainly based on single macro and present limited computingparallelism. Realizing a fully-integrated CIM chip with a complete neural networkmodel is still missing. The major challenges lie in: (1) The IR drop and transienterrors when carrying out MAC operations in non-volatile memory arrays decreasethe computing accuracy and further limit the parallelism; (2) The inefficiency ofthe interface blocks between different arrays due to the power overhead of the A/Dand D/A converters (shown in Fig. 33.2.1). To address these challenges, this workproposes: (1) A sign-weighted 2T2R (SW-2T2R) array to reduce IR drop bydecreasing the accumulative SL current (ISL), and eventually boost the computingparallelism; (2) a low-power interface design with resolution-adjustable LPAR-ADCto realize flexible tradeoff between system accuracy and power consumption. Inthis manner, this work implements a fully-integrated 784-100-10 MLP model onan integrated CIM chip with158.8kb analog ReRAMs. This chip realizes highrecognition accuracy (94.4%) on MNIST database, high inference speed (77μs/Image), and 78.4 TOPS/W peak energy efficiency. The CMOS circuits arefabricated in a 130nm process.

Figure 33.2.2 presents the algorithm, structure and work flow diagram regardingthe proposed ReRAM-based CIM chip. This work realizes a two-layer perceptronmodel, which consists of two fully-connected weight arrays and three neural layers.Accordingly, the chip structure is composed of a SW-2T2R array, a 1T1R array,input/output buffers, LPAR-ADCs, etc. In the SW-2T2R array, the positive weightand negative weight in a differential device pair are connected on the same outputcolumn, which is different from Ref. [2] or [3]. An x-bit signed weight (1-bit sign,x-1 -bit data) is stored in a SW-2T2R cell. During n parallel MAC operations, nLPAR-ADCs clamp the SLs to voltage (VSLCLP) and convert SL currents to digitaloutputs. The SL current is the accumulation result of all the SW-2T2R cell currentson a same column. Each MAC operation evaluates the product of one m-dimensional 1bit-input vector and an x-bit sign-weight vector. The output ofLPAR-ADC is stored in registers and sampled simultaneously to the next ReRAMarray as the input data. The resolution of LPAR-ADC is adjustable by changing thesampling clock frequency. The flexible configuration of the interface block helpsto achieve the balance between the system accuracy and power consumption. Theoutput of the second ReRAM array i.e. MAC2-OUT, is sampled by counters andstored in output buffers (shown in Fig. 33.2.2). If the resolution of the 1st-stageADC is configured as N1 bit, it will generate 2N1 pulses. Similarly, if the resolutionof the 2nd-stage ADC is set as N2 bit, for each pulse of the 1st-stage output, the 2nd-stage ADC generates 2N2 pulses. Thus, the inference time of one image will last for2(N1+N2) cycles at least. A higher ADC resolution could lead to better systemrecognition accuracy, while consuming more energy and latency.

Figure 33.2.3 presents the structure, operating timing diagram and truth table ofSW-2T2R array. In a SW-2T2R cell, two ReRAMs represent positive and negativeweights by utilizing opposite voltage polarity during inference stage. If VSL = VCLP,VBLP = VCLP – VREAD, VBLN = VCLP + VREAD, GPOS and GNEG would represent the positiveand negative weight respectively. Eventually, the equivalent weight of this SW-2T2R cell is WCELL(=GPOS-GNEG), which could be positive, negative or zero. The SLcurrent accounted for this weight pair is equal to the differential currents flowingthrough the positive cell and the negative cell, respectively. This current isproportional to WCELL according to ICELL = VREAD * WCELL. The 2T2R structure isdesigned to improve CIM accuracy by reducing the IR drop from two aspects: (a)if GPOS = GNEG, ICELL can be reduced to zero; (b) The current through the positiveweight and negative weight on the same column can be cancelled out locally.ReRAM precision determines the weight precision of a SW-2T2R cell, shown inthe truth tables of Fig. 33.2.2. The ReRAM-based weight presentation is definedaccording to the device intermediate states. If single ReRAM device works as a 1-bit (2 device levels) or 2-bit (4 device levels), the weight precision of a SW-2T2R

cell is signed quasi-2-bit (3-level) or signed quasi-3-bit (7-level) accordingly.According to the off-chip test on the same ReRAM stacks, the device conductancecould be tuned continuously. However, the on-chip ADC resolution limits that theon-chip ReRAM conductance could be quantified with 256 states at most.

Figure 33.2.4 shows the structure and timing diagram of LPAR-ADC. LPAR-ADCis composed of three sub-modules: an integrator, a comparator and a segmented-capacitor DAC (SC-DAC). The integrator consists of an operational amplifier (OPA)and an integrating capacitor. The integrator clamps the SL to VCLP and convertsthe SL current to an analog voltage signal. The SC-DAC generates a ramp voltagesignal from VCLP to VDD. The comparator is used to compare the ramp voltage signaland the integrated voltage signal. The ADC workflow includes three phases: (1)PH1: Reset ADC via keeping the RST_integ/EN_integ switch ‘ON’, and theEN_DAC/EN_comp switch ‘OFF’. In this phase, SL and OUT_integ are clamped toVCLP, and VRAMP remains at the initial voltage, i.e., VCLP. (2) PH2: Sample ISL. In thisphase, the RST_integ switch is cut off and Cinteg is accessed to the SL current.Charges from the SL and OPA are accumulated on the capacitor. The voltage ofOUT_integ is then changed accordingly. EN_comp is enabled to pre-charge thecomparator. (3) PH3: MAC-OUT. In this phase, the EN_integ switches off to cutoff the SL current read-path. Meanwhile, the voltage of OUT_integ is maintained.SC-DAC starts to count and generates a ramp voltage signal. The comparatorcompares the output of the integrator and SC-DAC, and generates a spike pulse.As illustrated in Fig. 33.2.2, the power consumption of LPAR-ADC is controlled bythe reference current source of the integrator and comparator, and the ADCresolution is configured by setting the frequency of the sampling clock. It is worthmentioning that integration and quantization method could filter out the currentovershoot and fluctuation by averaging the accumulative ISL over the integratingperiod. The quantized output minimizes the transient errors.

Figure 33.2.5 shows the experimental results on access time, power consumption,accuracy and speed during inference on the MNIST dataset. In the case of VDD =4.2V and VREAD = 0.2V, MAC-OUT access time is 51.1ns. The test result shows thatthe power consumption of SW-2T2R CIM chip is 1.9× lower than that of a 1T1Rversion. All the data are obtained based on a same CIM chip with different weightstructure. In addition, the test results show that the recognition accuracy of MNISTdataset is positively correlated with the ADC resolution of both stages, while theinference speed is positively correlated with the resolution of the 1st stage ADC.When the resolution of the 1st/2nd stage ADC is configured as 2bit/8bit, therecognition accuracy is ~92% and the inference speed is 77 μs/Image.

Figure 33.2.6 shows the test system and software interface when testing the CIMchip. The test system includes a FPGA board, a test chip board and a hostcomputer. The FPGA is used to move data and commands between the hostcomputer and the CIM chip. The simulation result of the 784-100-10 fully-connected NN shows that the SW-2T2R structure can effectively reduce accuracyloss due to the IR drop. Using 3-bit signed weight, the test result reaches anaccuracy of 93.4%, which is ~2% lower than the simulation result. The comparisonresults between this work and prior works are summarized. This work has achievedbetter performance under less advanced technology in terms of peak energyefficiency, MNIST dataset recognition accuracy, ADC resolution and the inferencespeed.

Figure 33.2.7 shows the die photomicrograph, the layout of a SW-2T2R cell, anda specific feature table. In summary, this work implements a 158.8Kb ReRAM CIMchip in 130nm CMOS process. For the first time, a CIM chip is fully integrated fora complete multi-layer NN model, and recognizes the MNIST images at a highspeed of 77 μs/Image, 78.4 TOPS/W peak energy efficiency and 94.4% testaccuracy.

Acknowledgements:This work is supported in part by the National Natural Science Foundation of China(61851404), National Key R&D Program of China (2016YFA0201801), BeijingMunicipal Science and Technology Project (Z191100007519008), Huawei Project(YBN2019075015), Tsinghua and National Tsinghua joint project, and BeijingInnovation Center for Future Chips (ICFC).

References:[1] W.-H. Chen et al., "A 65nm 1Mb Nonvolatile Computing-In-Memory ReRAMMacro with Sub-16ns Multiply-and-Accumulate for Binary DNN AI EdgeProcessors," ISSCC, pp. 494-496, Feb. 2018. [2] R. Mochida et al., “A 4M Synapses Integrated Analog ReRAM Based 66.5TOPS/W Neural-Network Processor with Cell Current Controlled Writing andFlexible Network Architecture,” VLSI, pp. 175-176, 2018.[3] C.-X. Xue et al., " A 1Mb Multibit ReRAM Computing-In-Memory Macro with14.6ns Parallel MAC Computing Time for CNN Based AI Edge Processors," ISSCC,pp. 388-390, Feb. 2019.

978-1-7281-3205-1/20/$31.00 ©2020 IEEE

Page 2: New ISSCC 2020 / SESSION 33 / NON-VOLATILE DEVICES FOR … · 2020. 4. 12. · ISSCC 2020 / SESSION 33 / NON-VOLATILE DEVICES FOR FUTURE ARCHITECTURE / 33.2 33.2 A Fully Integrated

501DIGEST OF TECHNICAL PAPERS •

ISSCC 2020 / February 19, 2020 / 2:00 PM

Figure 33.2.1: Computing-in-memory for AI applications. Figure 33.2.2: Proposed fully-integrated computing-in-memory chip.

Figure 33.2.3: Signed-weighted 2T2R array.

Figure 33.2.5: Measurement results during inference on MNIST dataset. Figure 33.2.6: Test system and comparison table.

Figure 33.2.4: Structure and timing diagram of LPAR-ADC.

33

Page 3: New ISSCC 2020 / SESSION 33 / NON-VOLATILE DEVICES FOR … · 2020. 4. 12. · ISSCC 2020 / SESSION 33 / NON-VOLATILE DEVICES FOR FUTURE ARCHITECTURE / 33.2 33.2 A Fully Integrated

• 2020 IEEE International Solid-State Circuits Conference 978-1-7281-3205-1/20/$31.00 ©2020 IEEE

ISSCC 2020 PAPER CONTINUATIONS

Figure 33.2.7: Die micrograph with SW-2T2R layout and feature table.Figure 33.2.S1: Cross-sectional TEM image and device characteristics ofReRAM cells.

Figure 33.2.S2: Comparison diagram between this work and prior works.Figure 33.2.S3: Wafer micrograph, 2bit-ReRAM mapping and 2bit-inputaccuracy.