1
A Survey of Spintronic Architectures for
Processing-in-Memory and Neural Networks
Sumanth Umesh* and Sparsh Mittal†
*IIT Jodhpur, †IIT Hyderabad.
E-mail:sumanth.2@iitj.ac.in,sparsh@iith.ac.in.
Abstract
The rising overheads of data-movement and limitations of general-purpose processing architectures have
led to a huge surge in the interest in “processing-in-memory” (PIM) approach and “neural networks” (NN)
architectures. Spintronic memories facilitate efficient implementation of PIM approach and NN accelerators,
and offer several advantages over conventional memories. In this paper, we present a survey of spintronicarchitectures for PIM and NNs. We organize the works based on main attributes to underscore their similarities
and differences. This paper will be useful for researchers in the area of artificial intelligence, hardware
architecture, chip design and memory system.
Index Terms
Review; “spin transfer torque RAM”, “spin orbit torque”, “domain wall memory”, “processing-in-memory”,
“machine learning”, “neural networks”
✦
1
I NTRODUCTION
As conventional von-Neumann style processors get progressively restricted by the data-movement
overheads [1], use of processing-in-memory (PIM) approach has become, not merely attractive, but even
imperative. Further, as machine learning algorithms are being applied to solve cognitive tasks of everincreasing complexity, their memory and computation demands are escalating fast. Since traditional
processors are unable to meet these requirements, design of domain-specific accelerators has become
essential. These factors and trends call for research into novel memory technologies, architectures and
design approaches.
Spintronic memories allow performing computations such as arithmetic and logic operations inside
memory. Also, they allow efficient modeling of neurons and synapses which make them useful for
accelerating neural networks [2]. These properties, along with the near-zero standby power and high
density of spintronic memories make them promising candidates for architecting future memory systems
and even computing systems.
Use of spintronic memories, however, also presents key challenges. Compared to SRAM and DRAM,
spintronic memories have higher latency and write energy. Also, most of the existing proposals have
implemented simple neuron models such as neuron producing “binary output” based on the sign of the
input. However, NN architectures aimed at solving complex cognitive tasks require modeling of more
realistic neuron models [2]. Further, since some spin neuron-synapse units cannot be connected through
spin-signaling [3], they need to be connected using CMOS (complementary metal-oxide semiconductor)
based charge-signaling. Evidently, design of spintronic accelerators for PIM and NN is challenging and
•
Sumanth worked on this paper while working as an intern at IIT Hyderabad. Support for this work was provided by Science and
Engineering Research Board (SERB), India, award number ECR/2017/000622
2
yet, rewarding. Several circuit, microarchitecture and system-level techniques have been recently proposed
towards this end.
Contributions: In this paper we present a survey of spintronic-accelerators for PIM and NN. Figure
1 summarizes the contents of this paper. Section 2 provides a background on key concepts and a
classification of research works on key parameters. Sections 3 and 4 presents techniques for designing
logic and arithmetic units, respectively. Section 5 discusses spintronic accelerators for a range of application
domains. In these sections, we focus on qualitative insights and not on quantitative results.
Paper organization
§2 Background and motivation
§2.1 Magnetic tunneling junction
§2.2 VCMA and VCMA assisted STT devices
§2.3 Domain wall memory devices
§2.4 Skyrmions and skyrmion based racetracks
§2.5 Spintronics v/s all spin logic
§2.6 Complete logic set
§2.7 Classification
§3 Spintronic logic units
§3.1 Bitwise logic
§3.2 Programmable switch and logic element
§3.3 Multiplexer and encoder
§3.4 Random number generator
§4 Spintronic arithmetic units
§4.1 Adder designs
§4.2 Approximate adder designs
§4.3 Multiplier designs
§4.4 Majority gate based designs
§4.5 LUT designs
§5 Spintronic accelerators for
various applications
§5.1 Neuromorphic computing
§5.2 Image processing
§5.3 Data encryption
§5.4 Associative computing
§6 Conclusion and future outlook
Fig. 1. Organization of the paper
Finally, Section 6 concludes this paper with a discussion of future challenges. This paper will be
useful for researchers interested in the confluence of machine learning, hardware architecture and memory
architectures. Table 1 shows the acronyms used in this paper. Input and output carry are shown as Ci and
Co , respectively.
TABLE 1
Acronyms used frequently in this paper
ADC
AES
ANN/CNN
CMOL
CMOS/pMOS/
nMOS
DPU
DRAM
DW
DWM
LSB/MSB
LSV
LUT
MAC
2
analog to digital converter
advanced encryption standard
artificial/convolutional neural network
hybrid CMOS/nanowire/MOLecular
complementary/P-type/N-type
metaloxide semiconductor
digital processing unit
dynamic random access memory
domain wall
domain wall memory
least/most significant bit
lateral spin valve
look-up table
multiply and accumulate
BACKGROUND
MCA
MJG
MTJ
MUX/DEMUX
NMC
memristive crossbar array
majority gate
magnetic tunnel junction
multiplexer/demultiplexer
nano-magnetic channel
NMOS
NVM
PIM
RRAM
SA
SRAM
TCAM
VCMA
N-type metal-oxide-semiconductor logic
non-volatile memory
processing in memory
resistive RAM
sense amplifier
static random access memory
ternary content addressable memory
voltage-controlled magnetic anisotropy
AND MOTIVATION
We now discuss relevant concepts and refer the reader to prior works for a background on NVMs [4–7].
2.1
Magnetic tunneling junction
An MTJ is a device consisting of two “ferromagnetic layers” separated by a thin metallic oxide tunneling
layer [8, 9]. The relative angular momentum or spin of the two “ferromagnetic layers” is leveraged to store
binary data. The layers can be in two possible orientations: one where both layers have the same or parallel
spins and the other where both layers have opposing or antiparallel spins as shown in Figure 2(a). While
in the parallel orientation, tunneling effect occurs in the oxide layer, resulting in low resistance across
the MTJ. When in anti-parallel orientation, the tunneling of electrons across the oxide layer is hindered,
resulting in high resistance. These two resistances are used to denote binary logic states ‘high’ and ‘low’. In
3
an MTJ, the orientation of one of the “ferromagnetic layers” is fixed and this layer is termed as ‘reference’
or ‘fixed’ layer. The second “ferromagnetic layer” is left free to change orientation and it is termed the
‘free’ layer. Altering the orientation of the free-layer provides a switching mechanism that toggles logic
states, similar to that in a transistor.
Free Layer
Parallel
Tunnelling Layer
Fixed Layer Heavy metal layer
Write direction
Read direction
Anti-parallel
(a)
(b)
(c)
Fig. 2. (a) “Parallel” and “anti-parallel” orientations of MTJs (b) STT-MTJ (c) SOT-MTJ
The switching mechanisms are mainly of two types, “spin transfer torque” (STT) and “spin orbit
torque” (SOT). In case of STT switching [8], an unpolarized current is passed through the fixed layer whose
spin imparts an angular momentum which results in a spin-polarized current. This current, when passed
through the free layer, transfers its angular momentum resulting in a change in free layer’s orientation.
MTJs switched using the STT effect are termed STT-MTJs and they are shown in Figure 2(b). In the case
of SOT switching [10], the free layer is attached to a strip of heavy metal. In order to write into the MTJ,
“spin Hall effect” is leveraged where an unpolarized current through the “heavy metal layer” results in a
“spin-polarized current” in a direction perpendicular to that of the unpolarized current. The spin current
so produced transfers its angular momentum to the free layer resulting in switching action. MTJs switched
using “spin Hall effect” are termed as SOT-MTJs and they are shown in Figure 2(c).
Figure 3 shows standard STT-RAM and SOT-RAM bit-cells. An STT-RAM bit-cell makes use of the
same set of terminals across the ferromagnetic layers for both read and write operations. On the other
hand, an SOT-MTJ has separate sets of terminals for read and write operations. Although the SOT-RAM
bit-cell requires extra terminals for its operation, it has an advantage since both write and read operations
can be independently optimized.
SL 𝑛
WL 𝑖
BL 𝑛 SL 𝑛−1
BL 𝑛−1 SL 0
BL 0
WBL 𝑛
WWL 𝑖
SL
BL
WBL
WL
RWL 𝑖
WW
L
SL 𝑖
RBL
RWL
WL 𝑗
MT
J
WBL 0 RBL 0
RBL 𝑛
SL
(a)
(b)
Fig. 3. (a) STT-RAM bit-cell and array (b) SOT-RAM bit-cell and array
Challenges of STT-RAM and SOT-RAM: Compared to SRAM, STT-RAM has higher write latency and
energy [11]. With ongoing feature size scaling, its “sensing margin” is further reduced [12]. Also, scaling
leads to a decrease in critical current required for switching which reduces the write energy overhead
[13]. However, the read current does not scale much and hence, the write current approaches the read
current leading to the phenomenon of “read disturbance” [12, 14]. Both STT-RAM and SOT-RAM suffer
from thermal instability which leads to retention failures. The instability increases with scaling and poses
a challenge to use of both memories. By virtue of using separate paths for read and write, SOT-RAM does
not suffer from “read disturbance”, however, due to this, its bit-cell area is higher than that of STT-RAM.
2.2 VCMA and VCMA assisted STT devices
VCMA switched MTJs rely on voltage for switching rather than current which is used by MTJ and SOT-
4
MTJs. VCMA MTJs have thicker oxide layers that act as capacitors [15, 16]. When a voltage pulse is
applied across the terminals, charge is accumulated at the oxide-ferromagnetic layer interfaces which, in
turn cause a change in the occupancy of the atomic orbitals. The change in orbital occupancy combined
with the STT effect induces a change in the magnetic isotropy of the MTJ. However, for voltages greater
than the threshold voltage, orientation of the free layer oscillates and the final orientation is dependent
on the duration of the voltage pulse. In order to eliminate the dependency on pulse duration, VCMA
assisted STT mechanism is used. A sufficiently large voltage is applied along the to induce oscillation
in the free layer and a smaller voltage pulse is applied for a longer duration to generate STT effect and
stabilize the final orientation of free layer [15]. The use of voltage rather than current for switching results
in significantly reduced power consumption due to minimal Joule heating and Ohmic losses which is a
major concern in case of STT-MTJ and SOT-MTJ. VCMA-MTJs also have higher packing densities than
their STT and SOT counterparts.
2.3
Domain wall memory devices
A DWM device consists of a ferromagnetic nanowire in which opposing spin creates a DW [17]. The DW
thus formed can be moved through use of spin-polarized currents. Similar to MTJs, domain wall devices
too can be operated using STT or SOT mechanisms. Here, instead of switching the orientation of free layer,
the STT and SOT techniques are used to displace the DW. Based on the mechanism used, the devices are
termed as STT-DWM or SOT-DWM devices.
Racetracks are made up of ferromagnetic nanowires of lengths sufficient to accommodate multiple
domain walls[18]. Each racetrack possesses nanoscale notches that stabilize the domain walls. This allows
each racetrack to store multiple bits [4]. A read/write head is formed by placing a ferromagnetic layer
on top of the racetrack to form an MTJ. The data to be accessed is brought under the read/write MTJ
by shifting it through domain wall motion. The key challenge in use of DWM is the latency and energy
overhead of shift-operations [4].
2.4
Skyrmions and skyrmion based racetracks
Magnetic skyrmions are topologically stable field configurations that possess particle-like properties [19].
They are created as a result of competing effects of Dzyaloshinskii-Moriya interactions, magnetic isotropy
and ferromagnetic exchange coupling in bulk ferromagnets and magnetic thin films [15, 20]. Skyrmions
have gained attention as candidates for racetracks due to their topological stability, low driving current
and small size. A racetrack with skyrmions instead of domain-walls stores data based on presence and
absence of skyrmions and not based on orientation of layers as in domain-wall racetracks. Such racetracks
would require a read head for skyrmion detection, a write head for skyrmion creation and a nanowire
for skyrmion motion along with CMOS based peripheral circuitry. Such skyrmion based racetracks can
outperform domain-wall based racetracks in terms of power consumption, packing density and robustness
[20].
2.5
Spintronics v/s all-spin logic
Spintronics refers to devices that use CMOS and spin-based components. An example of spintronic device
is STT-RAM which makes use of CMOS transistors and MTJs. These devices utilize both charge and
spin-polarized currents. The spin-polarized currents are used for altering magnetic states whereas the
charged currents operate the transistors. By comparison, all-spin logic makes use of only spin-polarized
components of the currents. An example of an all-spin logic device is a “lateral spin valve” [21]. Lateral
spin valves consist of ferromagnetic layers placed above conducting channels [21]. The input currents
through the metallic channels are spin polarized and exert STT effect on the ferromagnetic layers resulting
in a change in the magnetic orientation. The resultant sum of the spin polarized inputs is responsible for
switching of the ferromagnetic layers.
5
2.6
Complete logic set
A set of Boolean functions is said to form a complete logic set if all Boolean functions can be implemented
as a combination of the members of the logic set. The most commonly used complete logic sets are (1)
AND, OR and NOT (2) NAND (3) NOR (4) implication and NOT (5) majority gate.
Among these logic sets, the first three are grouped together under “reconfigurable” logic [22]. This is
due to the fact that PIM architectures relying on the technique of “reconfigurable logic” are capable of
implementing AND, OR, NOT, NAND and NOR with almost the same ease i.e., such designs are free to
make use of at least three (AND, OR and NOT) of the aforementioned logic gates to implement Boolean
functions. Reconfiguration is accomplished by varying the reference voltage provided to SAs. On the other
hand, the PIM architectures relying on implication technique can use only IMP and NOT operations, while
majority gate based architectures can use only one gate.
2.7
Classification
Table 2 first organizes the projects according to the type of memory used by them. Then, it underscores
their optimization objective. Table 2 further shows the application domains of different spintronic
architectures. Finally, it shows the projects that perform comparative evaluation of spintronic architectures
with other platforms e.g., CPU, FPGA, etc.
TABLE 2
A classification based on memory technology, optimization goal, application domain and comparative evaluation
Category
Reference
Memory technology used
STT-RAM
[2, 22–43]
SOT-RAM
[2, 44–52]
VCMA
[16, 53, 54]
DWM
[2, 3, 17, 45, 55–68]
Skyrmion
[15, 20, 69–71]
Optimization objective
Performance
nearly all
Energy
[2, 3, 17, 21–24, 26, 28, 29, 32, 33, 35, 36, 42–44, 46–48, 50–
52, 55–67]
Reliability
[22, 27, 30, 31, 47, 60, 64, 66]
Application area
Neuromorphic computing
[2, 3, 31, 32, 47, 48, 59–62, 68]
Image processing
[24, 61, 66]
Encryption
[29, 45, 50, 51, 64, 65, 67]
Associative computing
[23]
Comparison of spintronic architectures with
CPU
[24, 50, 51, 58, 59, 64–67]
GPU
[24, 48]
FPGA
[48]
ASIC
[29, 48, 50, 51, 64, 65, 67]
CMOS
[2, 17, 21, 26, 28, 32, 36, 41, 43, 46, 52, 61, 63, 66]
CMOL
[29, 50, 51, 64, 65, 67]
3
S PINTRONIC L OGIC U NITS
In this section, we discuss spintronic PIM architectures for bitwise operations (Section 3.1), programmable
switch and logic element (Section 3.2), MUX and encoder (Section 3.3) and random number generators
(Section 3.4). Table 3 classifies the PIM architectures for performing logic operations based on their design
features. It classifies the works as all-spin or spintronic logic. It then shows the bit-cell designs used by
different works.
Table 3 then shows the DWM device designs used in PIM accelerators. Further, it shows the circuit
modifications performed by PIM architectures. The approaches used for configuring the PIM architectures
can be broadly divided into two groups. In the first approach, operations are configured by directly
6
TABLE 3
A classification based on design features of PIM accelerators
All-spin logic and spintronic logic designs
[21, 42]
nearly all others
Bit-cell designs used in PIM accelerators
1T-1MTJ STT-RAM bit-cell with four terminals
[24–27, 29, 53]
1T-1MTJ STT-RAM bit-cell with programmable [26]
and read-only data
2T-1MTJ STT-RAM bit-cell with four terminals
[23]
2T-1MTJ STT-RAM bit-cell with five terminals
[26]
2T-1MTJ SOT-RAM bit-cell with five terminals
[45–49]
DWM devices used in PIM accelerators
Three-terminal DWM
[17, 36, 50, 55, 61, 62, 66, 68]
Four-terminal DWM
[32, 64]
Five-terminal DWM
[29]
DWM racetracks
[18, 56–60, 63, 65, 67]
Circuit techniques/designs used for PIM operations
Modified bit-cell structure
[25, 26]
Modified peripheral circuitry
[24, 26, 29, 45]
Configuration using reference voltages
[24, 26, 29, 45]
Configuration using binary data
[25–27, 57]
All-spin logic
Spintronic logic
providing appropriate voltage to the reference terminal of the SAs whereas in the second approach, the
binary data is either fed dynamically or is stored in the memory bit-cells. While the former approach is
analog in nature, the latter approach is digital in nature. The works that use these approaches are also
highlighted in Table 3.
Table 4 first shows the PIM operations performed by different research works. Then, it summarizes the
different logic sets used in PIM architectures. Further, it shows the works that use redundancy for various
objectives. Finally, it highlights the strategies for reducing write overhead.
TABLE 4
A classification of features and optimization strategies of PIM accelerators
PIM operations
Basic logic operations
[22, 25–27, 29, 30, 45, 46, 53, 58]
Programmable switch and logic element
[36]
Multiplexer and Demultiplexer
[39]
Encoder and decoder
[63]
Logic set used for PIM operations
Reconfigurable logic
[24–26, 29, 38, 41, 42, 45, 49]
Implication and NOT logic
[27, 53]
Majority gate logic
[3, 21, 30, 42, 50, 52, 66]
Use of redundancy
Redundant MTJs to
avoid the impact of variation [37], provide
reliability in the majority voting circuit [30]
Redundant bits to facilitate shifting in DWM
[58, 65]
Strategies for reducing write overhead
Achieving writes through shift operations
[55, 57]
Verify before shift
[57]
3.1
Bitwise logic
Jain et al. [26] propose three STT-RAM based PIM accelerators that perform logic, arithmetic and vector
operations. The first accelerator makes use of conventional STT-RAM arrays and modified peripheral
circuitry. The current flowing through the ‘source line’ of STT-RAM array represents the summation
of values in MTJs along the column. Modified peripheral circuitry consists of an additional external
input to configure the logic operation, a decoder that sets the signals for the reference current value, a
reference generator that generates reference currents, and a row-decoder that can enable two wordlines
7
simultaneously. They note that a vector operation leads to vector output and accessing this requires more
than one accesses. Since reduction operations generally follow vector operations, they use a “reduce unit”
which reduces the vector output to a scalar output so that it can be retrieved in a single-access.
The second accelerator works by modifying the bit-cell structure so that it stores a 1-bit programmable
data and 1-bit read-only data simultaneously, as shown in Figure 4. The MTJs attached to BL0 and BL1 store
read-only ‘0’ and ‘1’, respectively. The peripheral circuitry comprises a “pre-charge circuit”, two SAs and
two “current sources”, as shown in Figures 4(a) and 4(b). By pre-charging the “bit-lines” to either reference
voltage or zero, the programmable and read-only data are accessed, respectively. The read-only bits of the
bit-cells are used to form LUT to implement in-memory transcendental functions like logarithmic, sigmoid
and trigonometric functions.
BL0
BL1
BL0
Pre-charge
circuit
BL1
LL
BLL
WLM
Vref
MTJ
BLL = bit logic line, LL = logic line
WLM = write or logic mode
Vref
WL
SL
WL
BL
BL
SL
(a)
(b)
(c)
Fig. 4. (a) Bit-cell proposed by Jain et al. [26] which stores read-only and programmable data (b) Array structure with read-only and
programmable bit-cell cells. (c) 2T-1MTJ bit-cell
The third accelerator, shown in Figure 4(c), makes use of a 2T-1MTJ cell array. When “write or logic
mode” (WLM) is set to ‘high’, it works as a standard STT-RAM array that performs read and write
operations. When the WLM is set to ‘low’, it works as a PIM accelerator. Two bit-cells store the operands
and another bit-cell stores the output. Operand and output cells are enabled by the “bit logic line” (BLL)
and connected through the common “logic line” (LL). By appropriately configuring the input cells and
reference voltage, NOT, AND/NAND, OR/NOR and majority functions are implemented. They show
that their technique is more energy efficient than CMOS-PIM accelerators and provides comparable
throughput.
Mahmoudi et al. [27] propose an STT-RAM based PIM accelerator. The proposed architecture implements the implication and NOT logic operations and all Boolean functions are implemented as a
combination of these two operations. To perform implication operation, a current is passed via the common
“bit-line”. The bit-cells having the operands are selected by two unequal enabling voltages along their
word-lines. These unequal voltages lead to different channel resistances among the transistors and the
resulting asymmetry results in the implication operation. NOT is performed by directly changing the
orientation of the MTJ. The proposed structure allows implementation of basic Boolean logic functions in
a standard STT-RAM array without modification or extra peripheral circuitry. Their experiments confirm
the efficacy of their technique.
Jaiswal et al. [53] propose a PIM platform using VCMA driven MTJ array. VCMA employs voltage to
switch an MTJ instead of spin-polarized currents like an STT-MTJ. Operations are effected such that the
result is stored in one of the operand bit-cells. The computations of the array are based on implication
and NOT logic. In the case of implication logic, bit-cells containing operands are connected through a
bit-line and the mid-point of the bit-line acts as a voltage divider. The asymmetry in voltages is exploited
to obtain the result. NOT operation is directly performed by toggling the state of MTJ. Remaining Boolean
functions are implemented as a combination of implication and NOT. Their technique does not rely on SAs
and reference voltages for configuring the operations. The in-situ nature reduces the number of bit-cells
required for operations, thereby decreasing both area and logic complexity.
Wang et al. [54] propose a VCMA MTJ capable of implementing stateful Boolean functions such as
AND, OR and XNOR. The device consists of five layers as shown in Figure 5(a). The bias voltage Vb and
out-of-plane magnetic field Hex are leveraged as logic inputs. Based on critical points obtained from the
8
R-H curve (resistance-magnetic field curve), Hex is encoded to represent logic 0 and 1 in the form of logic
input q . Similarly, Vb is encoded to give logic 0 and 1 as input p.
Si/Si oxide substrate
MgO
Electrode
Ta
CoFeB
Co/Pd
Ri
Operation
0
(a)
AND
1
OR
q
XNOR
(b)
Fig. 5. (a) VCMA MTJ proposed by Wang et al. [54] (b) Table for configuring operations
If Ri represents the current state of MTJ, then the next state of the MTJ represented by Ri+1 is given by
Ri+1 = pRi +pq . By setting the value of Ri as shown in Figure 5(b), it is possible to implement AND, OR and
XOR operations. The result of the operation is stored in the MTJ. Their technique allows performing logic
operations in a manner similar to memory read and write operations. Also, their proposed VCMA-MTJ has
write latency and energy consumption in the orders of nanosecond and femtojoule per bit, respectively.
Comments: The techniques of Jaiswal et al. [53] and Wang et al. [54] both employ VCMA-MTJs. The
difference between them lies in the device being used. While the former makes use of general VCMA-MTJ,
the latter uses a specific five layer VCMA-MTJ. Also, the former performs Boolean computations based on
implication and NOT logic, while the latter performs AND, OR and XNOR based operations.
Fan et al. [45] present two dual-mode PIM accelerators. The first accelerator is based on an SOT-RAM
array. Memory access operations are performed by activating the suitable “write wordline” (WWL). For
computations, two wordlines containing the operands are activated using a row-decoder. The operation
to be performed (AND or OR) is determined by the value of reference voltage on the SA. This is similar
in working to a STT-RAM PIM platform that uses reference voltages to configure operations.
The second accelerator is based on racetrack memory and can implement parallel XOR operation. It
consists of perpendicularly coupled DWM racetracks made up of ferromagnetic nanowires, as shown in
Figure 6. The nanowire mesh is equipped with spin polarizers and sensing MTJs that act as write and read
heads, respectively. The bits present in the intersection region on the perpendicularly coupled nanowires
labeled A and B are taken as inputs while the resistance of the intersection MTJ gives the result of the
XOR operation which is stored in the intersection MTJ itself. This allows very fast and parallel in-memory
XOR computations making it suitable for data encryption. They implement AES data encryption on the
proposed PIM accelerators and show that their implementation consumes lower energy and area when
compared to CPU, CMOL and ASIC designs.
Read heads
Intersection
MTJ
A XOR B
Input A
Input B
Write heads
Fig. 6. Perpendicularly-coupled racetracks design proposed by Fan et al. [45] which can perform XOR operations
Parveen et al. [29] propose an STT-RAM based PIM architecture which can perform two-input logic
operations, viz., AND, OR, XOR, NAND, NOR, XNOR, between operands in a memory array irrespective
of their position. Figure 7(a) shows their PIM accelerator. The proposed design can work as both an
9
NVM and a PIM accelerator. Traditional STT-RAM arrays perform the memory read/write operations
whereas the computation mode is implemented through an extension to SAs using a 5-terminal DWM
device shown in Figure 7(b), and differential latch. For Boolean operations, first the domain wall is set
to its initial position and operands are read using SAs. Next, a sensing current is injected through the
extension circuit. The current can flow between any two terminals out of R+, R1- and R2- depending on
configuration of the extension circuit. The direction of this current and reference value of the differential
latch determine which operation is performed.
Row/column decoders
Transmission gates
SA
Extension
STT-RAM array
5 terminal
DWM
R+
R1-
W+
W-
Latch
Domain wall
(a)
Out Out
R2-
(b)
Fig. 7. (a) PIM accelerator proposed by Parveen et al. [29] (b) Five terminal DWM device used in their accelerator [29].
Their implementation consumes lower energy than racetrack and other MTJ based PIM implementations. However, their implementation is slower due to increased latency of individual Boolean computations. Compared to CMOS-ASIC implementation, their proposed platform provides higher performance
and better energy efficiency for bulk-bitwise operations. Also, their accelerator consumes less energy for
AES data-encryption than CMOS-ASIC and CMOL implementations.
Comments: The STT-RAM and SOT-RAM based architectures proposed by most works [24–27, 45]
require the operands to be located in a common row or column. By comparison, the design of Parveen et
al. [29] does not have this limitation since it reads the operands in two different cycles.
Kang et al. [25] present an STT-RAM based PIM accelerator that performs bulk bitwise operations.
Their design has a complementary STT-RAM array structure and exploits the peripheral circuitry of the
memory with minor modifications. No extra processing units are required. The two operands are stored
in two different wordlines while data from a third wordline is used to configure the logic operation.
Therefore, programming is equivalent to writing to an MTJ. Using this design, AND and OR operations
are performed. It can be extended to incorporate NOT, NAND and NOR by adding a MUX after each
SA. This, however, requires more wordlines to configure the operation. Results show that the latency of
performing logic operations on the proposed PIM platform is nearly same as that of reading from a bitcell. Their design makes it possible to perform logic operations in a manner similar to memory-readout
without additional hardware.
Comments: The techniques of Kang et al. [25] and Jain et al. [26] use binary data for configuring
the array to perform the required operation. The latter uses a special binary input which is provided
dynamically in a continuous manner during operation. In the former case, the configuring inputs are
stored in bit-cells on the STT-RAM array itself.
Zhang et al. [46] present a PIM accelerator based on voltage-gated SOT-RAM array. The voltagecontrolled “spin Hall effect” switching of the MTJ is exploited to perform in-situ logic operations. In case
of voltage gated MTJs, two inputs are needed for changing the state. As shown in Figure 8(a), one input is
the switching current and the other input is the bias voltage. The critical switching current is modulated
by the “bias voltage” across the MTJ. A single operating MTJ can evaluate the function Bi+1 = A·Z + Ā·Bi ,
where Bi is the original MTJ state, Bi+1 is the output that is stored as the new state of B . A is the bias
voltage such that a positive bias represents logic high and the zero bias represents logic low. Z represents
polarity of switching current and by changing the value of Z , two-input AND, OR and XOR functions are
implemented, as shown in Figure 8(b).
Since the output is also stored in the same MTJ, the operations performed by their accelerator are insitu in nature. These MTJs arranged in a cross-point fashion form a PIM accelerator suited for bulk-bitwise
operations. The MTJs along a row are accessed concurrently using bit-lines similar to a conventional SOTRAM. Exploiting this feature allows bitwise operations to be performed with high degree of parallelism.
10
Bias voltage (A)
Switching
current (Z)
Z = B𝑖
B𝑖 , B𝑖+1
B𝑖+1 = A XOR B𝑖
SOT-MTJ
(a)
B𝑖+1 = AZ+AB𝑖
Z=0
B𝑖+1 = A + B𝑖
B𝑖+1 = AB𝑖
Z=1
(b)
Fig. 8. (a) Voltage gated SOT-MTJ (b) Using Z as a control signal to implementation AND, OR and XOR operations in the work of
Zhang et al. [46]
Compared to CMOS logic gates, their proposed design has higher latency due to longer switching time of
MTJs. However, the static power consumption is greatly reduced due to their non-volatile nature.
Chang et al. [49] propose a PIM architecture through integration of SOT-RAM based memory and
reconfigurable-logic. It consists of standard SOT-RAM array for memory, SOT reconfigurable logic similar
to the SOT based PIM array [45], interconnections and a controller. The controller handles programming
of SOT logic, instructions and address distribution in the SOT-RAM array. Interconnections facilitate data
transfer between memory and logic. The proposed design makes use of identical memory and storage
elements which avoids the issue of technological incompatibility between DRAM and SOT-RAM. Use of
SOT-MTJ overcomes high latency of STT-MTJ and allows off-line programming and high speed operations.
The proposed design provides higher performance than DRAM and STT-RAM based PIM accelerators.
The performance advantage is even higher for iterative computations that require writing to memory
frequently.
Mahmoudi et al. [22] analyze the two main approaches to in-memory bitwise operations, viz.,
reconfigurable logic and implication logic. Reconfigurable logic refers to techniques that implement
Boolean functions as combinations of spintronic AND/OR, NAND/NOR, XOR/XNOR and NOT gates.
These gates are themselves implemented by applying appropriate reference voltages on SAs. On the other
hand, implication logic refers to techniques in which Boolean functions are implemented as a combination
of implication and NOT operations such as that used in [27]. In the case of implication implementations,
multiple logic fan-outs are handled with the help of a combination of implication and NOT operations
such that intermediate writing and sensing is eliminated. This results in higher reliability and lower power
consumption for implication based systems than reconfigurable implementations. However, the number
of logic steps needed for implementing complex functions is higher in case of implication logic [22] than
in the case of reconfigurable logic.
WL𝑛
WL1
BL1
WL2
SL1
BL1
SL1
Fig. 9. Coupled STT-RAM array structure for PIM as proposed by Mahmoudi et al. [22]
They present two approaches to reduce the number of steps. The first approach combines implication
and reconfigurable logic. Such combination makes it possible to use AND, NAND, implication and NOT
operations which significantly lowers the number of steps needed to implement “complex logic functions”.
This approach provides higher performance and energy efficiency than implication implementations, but
also suffers from higher error probabilities. The second approach is based on parallelization of STT-RAM
arrays so that multiple operations are performed simultaneously. It makes use of coupled STT-RAM arrays
as shown in Figure 9. This approach does not reduce the number of steps, but provides faster execution
due to parallelization.
11
Wang et al. [72] present a spintronic memory which brings together the advantages of STT-RAM and
SOT-RAM while eliminating their disadvantages. In the case of STT-RAm and SOT-RAM, the critical
current required for transitioning from parallel to anti-parallel state is higher than the critical curent
required for transition from anti-parallel to parallel state. Aslo the two critical currents are opposite in
direction. These factors lead to source degradation. In order to compensate for the effect, sufficiently large
access transistors need to be used keeping in mind the worse case (parallel to anti-parallel state) of write
operations. This leads to high current and reduced reliability for the other case (AP to P). Also, STT-MTJ
has high switching latency. On the other hand, SOT-RAM requires two access transistors which leads to
low packing density. Also, SOT-RAM does not address the problem of source degradation and requires
higher current density than STT-RAM for write operations.
Heavy metal layer
BL
WL[3]
V𝐷𝐷
I𝑒𝑟𝑎𝑠𝑒
Bit-cell
BL
WL[2]
BL
WL[1]
I𝑤𝑟𝑖𝑡𝑒
I𝑟𝑒𝑎𝑑
BL
WL[0]
PSL
nMOS
pMOS
NSL
Fig. 10. Structure of a NAND like block proposed by Wang et al. [72]
Their technique combines STT-RAM and SOT-RAM into a NAND based Flash memory like structure,
as shown in Figure 10. Each string or block comprises of MTJs whose free layers are attached to the
heavy metal layer. Each MTJ is accompanied by one access transistor and a pair of pMOS and nMOS
select transistors. Write operation is carried out in two steps. First, a current Ierase is passed through the
heavy metal layer which sets all the elements of the block to AP. Second, access transistors for MTJs to be
switched and the pMOS select transistor are turned on, the nMOS transsitor is turned off and bit-lines are
grounded. Write current Iwrite induces switching through STT effect. For read operation, access transistor
is set, pMOS transistor is turned off and nMOS transistor is turned on. The current Iread through the MTJ
is passed through a sense amplifier to obtain the bit-value.
Their technique successfully addresses the problem of source degradation since both Iwrite and Ierase are
unidirectional. It has lower power consumption compared to STT-RAM and occupies less area compared
to SOT-RAM.
3.2
Programmable switch and logic element
Hanyu et al. [36] present spintronic components for FPGAs. They present a programmable switch and a
programmable logic element. The switch, shown in Figure 11(a), is made up of MTJs to hold programmed
data, write circuits and a SA. The switch is programmed by writing to the MTJs. The SA reads the
programmed data ‘M’, which is maintained at location ‘Q’. The value at ‘Q’ is used to toggle the
NMOS transistor connected to the routing track. Through this method, the NMOS transistor behaves
as a programmable switch that can be used in an FPGA. Once programmed, the switch remains so even
when not actively powered, thereby reducing static power consumption.
Their programmable logic element is illustrated in Figure 11(b). The operation to be performed is
configured in this element by programming the configuration cells which are made up of three-terminal
DWM devices. Operands are provided through the select lines of the MUX and the result is sensed through
the sensing circuit. The design can be extended to N -bits through addition of more configuration cells and
MUXes to accommodate more select lines. The area of both the proposed devices is smaller than that of
their CMOS counterparts.
12
Q
MOS
switch
SA
Sensing
circuit
Inputs
Write circuit
BL
EN
M
M
Configuration
array
Routing
track
DWM
cells
Control circuit for
read and write
Control signal
(a)
(b)
Fig. 11. (a) Programmable switch and (b) programmable logic element proposed by Hanyu et al. [36]
Hanyu et al. [37] present a spintronic LUT for use in FPGAs. The proposed LUT circuit is shown in
Figure 12. It comprises of a CMOS logic tree made up of combinational circuits, a reference tree, a SA
and MTJs for holding the programmed data. This design has a number of CMOS transistors (in the logic
tree) and MTJs connected in series. To remove the impact of variation in their characteristics, the proposed
design employs redundant MTJs along each series path to control the operating point of the LUT. The
proposed LUT is non-volatile and thus, has no standby power consumption. Despite using redundant
MTJs, their LUT consumes lower area than the CMOS-only LUT since the MTJs share a single write
circuit.
Out SA Out
Reference
logic tree
MTJ for storing
LUT data
CMOS logic tree
MTJ MTJ
MTJ MTJ
MTJ MTJ
MTJ MTJ
MTJ
MTJ
Redundant MTJ
Fig. 12. LUT design proposed by Hanyu et al. [37]
3.3
Multiplexer and encoder
Kumar et al. [39] present a 2×1 MUX and a 1×2 DEMUX which are designed using STT-MTJs. Figure 13(a)
shows the design of MUX. It is made up of two MTJs whose free layers are connected by an NMC. The
two inputs to the MUX are the currents I0 and I1 . The directions of these current denote the logic state
which is stored in the MTJs. Once the orientation of MTJs is set, the select-current Is is passed through the
select-line which determines the output represented as the new state of MTJ A. If a select-current Is flows
from S to ground, the value of MTJ B is transported to MTJ A due to communication between MTJs via
NMC. On flow of current in opposite direction, the state of MTJ A remains unchanged.
MTJ B
MTJ A
MTJ A
𝐷1
MTJ A
𝐷0
S
S
Free-layer
Nanomagnetic
channel
I1
(a)
I0
I
0
I S 0
(b)
Fig. 13. (a) 2×1 MUX [39] where I0 and I1 are the inputs while S is the select line. (b) 1×2 DEMUX where S is the select line and I
is the input which is routed to either of the outputs D0 or D1 . ‘0’ represents the current input corresponding to logic ‘0’.
Figure 13(b) shows the design of DEMUX. The output is obtained as the logic state of either D0 or D1
depending on the select current Is , input current I and reference current Tref . The direction of Iref is kept
constant and equal for both MTJs. The proposed design is an MTJ-only design and also demonstrates logic
13
communication between MTJs through NMCs making it possible to implement more complex devices. It
is superior to CMOS MUX in terms of area and energy-delay efficiency.
Deb et al. [63] propose two racetrack memory based encoder/decoder designs, one of which follows a
dynamically reconfigurable encoding scheme and the other follows a fixed encoding scheme. Both designs
require N racetracks to implement a N -bit design. In the first design, a control signal toggles the device
between read and write modes. The encoding scheme is stored in the form of binary data on the racetrack.
By changing this scheme, the encoding scheme can be changed dynamically. Data is written using the
read/write MTJ, and then, the DW is shifted so that the next domain is available for writing. In read
mode, the encoded output is obtained at a SA by sensing the read/write MTJ on the racetrack.
Their second design implements a fixed encoding scheme and lacks the write-circuit present in the
first design and thus, the encoding scheme is manually written into the racetracks and cannot be changed
dynamically. This design trades reconfigurability for lower area and better energy efficiency. The proposed
devices are meant for use in interconnects and buses, where suitable encoding schemes can reduce power
consumption. The results show that at each bit-width, the reconfigurable design consumes higher leakage
power than the CMOS-only implementation. Larger bit-width designs are slower due to increased size
of SA whereas the smaller designs have greater operating speeds as compared to CMOS-only designs.
However, the non-reconfigurable design has both lower energy consumption and higher operating speeds
as compared to CMOS-only designs for all bit-widths.
Huang et al. [57] present a racetrack memory based PIM accelerator to implement Boolean logic
functions and basic devices like adders. The basic design consists of three racetrack strips, two of which
hold operands while the third ‘reference racetrack’ holds ‘reference data’. Operations to be performed are
configured by programming the reference cells while the ‘reserved cells’ provide an extra cell so that data
is not lost while shifting. The use of binary data for configuration makes the proposed design easier to
program than the accelerators that use explicit voltage values since programming is equivalent to writing
into racetracks. Their design can implement AND, OR, NAND and NOR operations. Figure 14 shows
the circuit of a two input AND/OR gate. While circuit for both operations is the same, they differ in the
contents of the reference cells. When the reference data is ‘10’ the circuit behaves as an AND gate and
when the data is ‘01’ the circuit behaves as an OR gate.If the reference cells are removed, the circuit so
formed acts as a buffer.
Their design incorporates “shift-only” approach such that the data is written into the racetrack only
once while rest of the operations are implemented using only bit-shift approach. It also employs a “verify
before shift” approach that stops shifting if the stored and input signal are of the same logic state. Both
approaches reduce the number of write operations significantly.
RE
RE
RE
RE
Y
Y
Y
Y
1
0
AND
0
1
OR
A
x
0
1
0
A
x
0
1
0
B
x
0
1
0
B
x
0
1
0
RE
Reference cells
Operand cells
RE
(a)
(b)
Fig. 14. Configurations of the two-input gate proposed by Huang et al. [57] for achieving (a) AND and (b) OR operations
Their design is optimized for reconfigurability and allows three methods to program the operations.
The first is initial configuration, similar to that of an FPGA, where all operations are configured before the
execution begins. Although simple to implement, this technique has limited flexibility. The second method
involves using shifting to change the values of the reference cells. This method allows higher flexibility. The
third method employs shifting the operands instead of the reference values. By virtue of using “shift-only”
and “verify before shifting” strategies, their technique incurs low latency and energy consumption. Of the
above mentioned three methods of reconfiguration, the second method of dynamically reconfiguring
14
reference-cell values is the fastest. The proposed accelerator offers a low-power, reconfigurable and highspeed implementation of logic circuits within memory.
3.4
Random number generator
Random number generators work on the stochastic properties of spintronic circuits. Table 5 shows the
specific MTJ parameter which is stochastic in nature and is leveraged for designing stochastic circuits.
Table 5 also shows the parameters which are used as knobs to control stochastic nature of MTJs. We now
review works that propose spintronic random-number generators.
TABLE 5
Classification of stochastic computing architectures
Category
Reference
The MTJ parameter which is stochastic
Switching delay
[35]
Switching probability
[31, 34]
Parameters varied to exploit stochastic properties of MTJs
Write current
[31, 35]
Operating current
[34]
Variations in tunneling and free layer thickness
[34]
Naviner et al. [34] propose a random number generator using an STT-MTJ. The probability of switching
of an MTJ depends on the switching time, operating current and critical current. Deterministic switching
occurs when operating current is greater than the “critical current”, whereas probabilistic behavior is
obtained by keeping the operating current lower than the critical current. The variations in thickness of
oxide layer and free layer of MTJ result in non-uniform tunneling magnetoresistance ratio because of
which STT switching is intrinsically stochastic. Both the above-mentioned phenomenon are leveraged to
achieve stochastic behavior in an MTJ. The stochastic behavior is used to implement a 1-bit random
number generator. They show that the implementation of a polynomial function with their number
generator consumes much less area than that using binary signal. This demonstrates the possibility of
area optimization in stochastic logic circuits as compared to their binary counterparts.
Wang et al. [35] propose a “true random number generator” based on the STT-MTJ. The proposed
design exploits the fact that due to thermal fluctuations and magnetizations, the switching delay of an
MTJ is stochastic in nature. Figure 15 shows the architecture of their proposed random number generator.
The random-write circuit consists of two MTJs, one for the reference value and the other for number
generation. The SAs are equipped with generating circuitry that produces a write current according to the
random number generation probability.
Correction circuit
Counter
Comparator
Random number
writing circuit
Feedback
Sense amplifier
Random bit-stream
Fig. 15. Block diagram of the random-number generator presented by Wang et al. [35]
To achieve a bitstream with ideal randomness, i.e., 50% probability of ones and zeros, a correction
circuit is used which is composed of counters and comparators. The circuit works in three phases: (1)
in the “reset phase”, both MTJs are set to initial low resistance values (2) in the “writing phase”, the
current generated by the generating circuitry is used to write to the MTJ. (3) In the “sensing phase”, the
random bit is sensed at the output of the SA. The generated bit-stream is passed to the correction circuit
which produces a control signal to tune the write current for the next cycle which helps in achieving ideal
probability of ones and zeros. The proposed design uses an intrinsic phenomenon instead of physical
15
imperfections as the source of entropy. This reduces the amount of post-processing required to ensure
high reliability. Hence, it achieves high performance and tolerance to variability without additional area
overhead.
4
S PINTRONIC A RITHMETIC U NITS
In this section, we discuss various arithmetic units such as (precise) adder (Section 4.1), approximate adder
(Section 4.2), multiplier (Section 4.3), majority gate-based designs (Section 4.4) and LUT designs (Section
4.5). Table 6 classifies these works on several important parameters. We now review these works.
TABLE 6
A classification of arithmetic units
Category
Adder
Multiplier
Arithmetic logic unit
LUT
Reference
[17, 21, 43, 44, 52, 55–57, 66]
[56]
[38, 40–42]
for multiplication [58, 65, 67], transcendental functions [26, 59], Boolean functions
[58], used in FPGA [37]
Approximate computing approaches
Approximate adder
Ignoring carry-in (Ci ) [33], inexact writing
of one input [33], Taking carry-out (Co ) as
the sum [66]
Achieving transcendental functions with LUT
[26, 59]
Fixed-point instead of floating-point
[60]
4.1
Adder designs
Roohi et al. [52] present a spintronic adder based on SOT-MTJ. 1-bit full adder is implemented through
formation of MJGs. It consists of three SOT-MTJs, a SA and write circuits. Two of the MTJs form 3-input
MJGs while the third MTJ forms a 5-input MJG. The SOT-MTJ has lower latency and energy requirements
as compared to the STT-MTJ. The adder can be extended to N -bits and is intended for use in spintronic
ALUs. They show that the proposed 1-bit adder has lower static and dynamic power consumption and
smaller area than a CMOS-only adder. However, it is slower than the CMOS-only adder since the switching
latency of SOT-MTJs is higher than that of CMOS transistors.
Roohi et al. [17] present a full adder based on 3-terminal domain wall devices. The proposed design
uses MJGs to formulate the sum and output carry functions of the adder. For a 1-bit full adder, it makes
use of one 3-input MJG and a 5-input MJG. It has two SAs to read the outputs, one for sum and the
other for carry. The adder can work in two modes. If a low current is used, it functions with low power
consumption but also lower speeds. On the other hand, using a current of higher magnitude results in
higher operating speeds but also higher power consumption. Their proposed adder has lower area and
design complexity than the CMOS-only designs.
Huang et al. [18] present an adder based on racetrack memory. One racetrack is used per input or
output signal and a DEMUX is used for sharing the inputs for both “sum” and “carry” operations. The
1-bit adder is made of two circuits, one for computing sum as shown in Figure 16(a) and one for output
carry as shown in Figure 16(b). This design is extended to multiple-bits by replacing the MTJs used for
storing inputs with racetracks. The circuits for computing both sum and output carry are the same; the
only difference is in the values of reference voltage Rref used by them. The corresponding values of Rref
for the two circuits are given in Figure 16. By virtue of sharing of racetracks and demultiplexing strategies,
their adder consumes low area and energy.
Trinh et al. [55] propose a racetrack memory based “multi-bit adder”. The building block of their multibit adder is a one-bit “full adder” which is shown in Figure 17(a). The carry is evaluated as a majority
function and implemented by connecting in series all the inputs in one branch and their complements in
another branch. The proposed adder is different from spintronic adders such as [17, 28, 44, 52] since all
the operands are stored in MTJs and no logic-tree like circuitry is involved.
16
RE
C𝑜
RE
C𝑜
R𝑟𝑒𝑓
RE
RE
RE
RE
Sum
RE
Sum
RE
R𝑟𝑒𝑓
A
C𝑜
C𝑖
B
1
2
A
B
1
2
1
R
2 𝑃
C𝑜 = 0, R𝑟𝑒𝑓 = R𝑃
1
2
R𝑟𝑒𝑓 = R𝑃 + R𝐴𝑃
C𝑜 = 1, R𝑟𝑒𝑓 =
Output carry
Sum
(a)
+ R𝐴𝑃
(b)
Fig. 16. 1-bit full-adder proposed by Huang et al. [18]. (a) Implementation of output carry (b) Implementation of summation
Sum
Sum
Co
SA
Carry
circuit
Co
SA
A
C𝑜
Writing
circuit
C𝑜
A
A A
A
A A
B B
B
B B
B
B
B
Ci Ci
C𝑖
C𝑖
Shifting
circuit
CLK
(a)
Shifting Enable
current
VDD
(b)
Fig. 17. (a) 1-bit full-adder proposed by Trinh et al. [55]. Sum and Co are the summation and output carry respectively, obtained
from addition of A, B and Ci . (b) Multi-bit racetrack adder [55]
They further extend their single-bit adder to a multi-bit adder by replacing individual MTJs with
racetracks, as shown in Figure 17(b). Multi-bit inputs and their complements are stored in separate
racetracks. At the positive edge of a synchronising clock pulse, the output carry is calculated and used as
an input for the next operation. At the negative edge, all racetracks are shifted by one bit to bring a new set
of inputs under the read and write heads. The racetracks allow storing multiple bits of data on the same
racetrack, thereby making it easier to perform multi-bit operations. The multi-bit inputs are written onto
racetrack once and shifted, thereby reducing the number of write operations. Their proposed multi-bit
adder consumes lower area and energy than a CMOS-only adder.
Comments: Unlike the adder proposed by Trinh et al. [55], the adder of Huang et al. [18] does not
store the complements of the inputs on racetracks. Secondly, it shares the inputs between carry and sum
circuits through demultiplexing. Due to these strategies, the adder proposed by Huang et al. provides
higher performance and energy efficiency than that proposed by Trinh et al.
An et al. [21] present a full adder based on all-spin logic. Their design utilizes graphene based LSVs
to form majority logic gates which, in turn, implement the addition operation. The sum and carry are
generated using conventional majority gate synthesis. The proposed adder can be extended to N -bits
through simple cascading, similar to the strategy used in ripple-adder. However, this causes an increase
in the length of the NMC resulting in higher operational delays. This limitation can be mitigated through
the use of a carry look-ahead adder design. Compared to a CMOS adder, their proposed adder has higher
dynamic energy consumption but lower area and near-zero standby power consumption.
Matsunaga et al. [28] present an STT-MTJ based full adder. Its general architecture is illustrated in
Figure 18. It comprises of a SA, a “dynamic current source” (DCS), a “logic tree” and two MTJs. The DCS
17
cuts-off flow of steady current to reduce power dissipation. The “logic tree” consists of a CMOS circuit
that determines which operation is to be performed. By changing the logic tree, different operations such
as the Boolean functions AND and OR are implemented. Of the three operands required for addition, only
one is stored in the MTJ and the other two are provided dynamically during execution. Results shows that
the proposed circuit consumes lower area and energy than a CMOS-only implementation. This is because
of the reduced number of current paths and reduced static power dissipation.
WL 2
WL 1
V DD
Clk
Sum
A
C𝑖
BL
B
A
Clk
Sum
A
C𝑖
C𝑖
WL 3
WL 4
Clk
Clk
Carry
Carry
A
A
C𝑖
C𝑖
B
B
Storage MTJs along
with read and write
circuits
B
BL
Clk
Clk
Clk
Clk
Logic-tree
Timing circuit
Fig. 18. Full adder proposed by Matsunaga et al. [28]
Deng et al. [44] propose a SOT-MTJ based full adder. The proposed adder circuit is based on a hybrid
CMOS-MTJ model consisting of two MTJs to hold complementary inputs, write circuits to write into the
MTJs, a logic tree circuit that determines the operation to be performed, and a sensing circuit to read out
the output. This structure bears similarity with that of the STT-MTJ based adder proposed by Matsunaga
et al. [28]. However, their SOT-MTJ based design has lower write latency than the STT-MTJ based design.
The proposed adder is capable of achieving sub-nanosecond switching with low write energy. Their design
has lower latency and energy than the conventional STT-MTJ based adders.
Comments: In the designs of Deng et al. [44] and Matsunaga et al. [28], the STT-MTJs or SOT-MTJs are
used only for storing operands, while most of the logic is implemented through the CMOS logic tree.
Lokesh et al. [38] present a spintronic ALU based on a full adder. Bitwise operations such as AND, OR,
XOR and XNOR are implemented by modifying the full adder. Figure 19 shows the truth table of a full
adder and subtractor. From this, it is observed that for the first four combinations when Z=0, (1) both the
sum and difference are equal to A XOR B (2) carry is equal to A OR B and (3) borrow is equal to A AND
B. Similarly, for the last four combinations when Z=1, (2) both sum are difference are equal to A XNOR B,
(2) carry is equal to A AND B and (3) borrow is equal to A OR B. Hence, by modifying a full adder and
subtractor circuit, two-input AND, OR, XOR and XNOR functions are obtained. Input Z can be used as
(1) a control signal to implement AND, OR, XOR/XNOR, (2) input carry in case of addition and (3) input
borrow in case of subtraction. Their design offers the possibility of developing completely non-volatile
computing systems with zero-start up time.
Z
A
B
Sum
Difference
0
0
0
0
0
Operation
Carry
Operation
0
0
1
1
1
0
1
0
1
1
0
1
1
0
0
1
1
0
0
1
1
0
0
1
0
1
0
0
1
0
1
1
0
0
0
1
1
1
1
1
0
XOR
XNOR
0
0
1
1
Borrow
Operation
0
OR
1
1
AND
1
AND
0
OR
1
Fig. 19. Table describing the use of full adder to perform AND, OR, XOR and XNOR operations in the technique of Lokesh et al.
[38]
Patil et al. [41] present a technique for efficiently combining spintronic logic units into larger blocks.
18
Using this technique they propose a spintronic ALU based on the design of a “full adder”. The output of a
spintronic logic circuit is determined by the state of every element in series. As a result, spintronic circuits
cannot be combined in the same manner as as CMOS based logic circuits. For example, on designing a
1-bit adder-subtractor by directly combining a spintronic 1-bit subtractor and a 1-bit adder, two problems
are observed. The first is the need for control inputs to distinguish between addition and subtraction. The
second is the necessity of a switching circuit to select carry or borrow while extending the design to N -bits.
To mitigate these problems, they propose a “neutralization” technique which involves switching-off
a certain part of the combined circuit so that the desired result is obtained. The neutralization technique
allows building complex spintronic logic circuits. Three methods are proposed to achieve neutralization
and are illustrated in Figure 20: (1) using control inputs on the MTJ. This strategy is valid only for MTJs
having single input. Two control signals X and Y are used to control the state of the MTJ. For example, if
X and Y are equal, then the state of R1 is same as X, otherwise, R1 takes the state of input A (2) using an
NMC to change the state of an MTJ so that both terminals of an SA have the same resistance. For example,
the state of MTJ R1 is made equal to the state of MTJ R2 by transferring the state from R3 using the NMC.
X
R3
R1
X
R1
R1
A B C
SA
A B X1
SA
R2
A X Y
R2
(a)
(b)
A B X2
(c)
Fig. 20. Illustration of different techniques of ‘neutralization’ proposed by Patil et al. [41]. (a) Neutralization using control signals (b)
Neutralization using STT (c) Neutralization using logic
(3) The third method is based on specific observations which apply only to certain operations. For
example, for achieving XOR operation, it can be noted that if MTJs R1 and R2 attached to the terminals of
an SA different states, the output is a ‘1’, but if they have the same states, the output is ‘0’. Their proposed
ALU uses these three methods to perform addition, subtraction and Boolean operations. The ALU can be
extended to N -bits. It consumes lower area and energy than a CMOS-only ALU and achieves comparable
speed, however, it requires more control inputs.
Ren et al. [43] present the energy analysis of a 1-bit STT-MTJ based adder circuit. They compare the
MTJ adder with static and dynamic CMOS designs. The MTJ based adder is similar in design to the logic
tree-based adders presented by Matsunaga et al. [28]. One of the inputs to the adder is stored in the MTJ.
It is noteworthy that the static CMOS adder requires the least number of transistors while the MTJ-based
adder requires the highest number of transistors. Simulations show that the dynamic-CMOS and MTJbased adders have higher energy efficiency than the static-CMOS adder. However, the dynamic-CMOS
adder provides superior “energy-delay tradeoff” than the MTJ-based adder since MTJ switching consumes
high energy.
4.2
Approximate adder designs
Angizi et al. [66] present a spintronic adder circuit which is designed with 3-terminal DWM devices. The
proposed adder can perform both approximate and accurate computations. The DWM devices are used
to implement MJGs as shown in Figure 21, which in turn form the adder circuit. While the accurate adder
is implemented with conventional MJG synthesis, the approximate adder is implemented as Sum = Co =
M ajority(A, B, Ci ). Table 7 shows the truth-table of their adder. Clearly, while Sum is wrong for two out
of eight cases (shown by ✗), Co is correct for all cases.
19
TABLE 7
Truth table of the adder proposed by Angizi et al. [66].
A
0
0
0
0
1
1
1
1
Input
B
0
0
1
1
0
0
1
1
Ci
0
1
0
1
0
1
0
1
Accurate output
Co
Sum
0
0
0
1
0
1
1
0
0
1
1
0
1
0
1
1
Approximate output
Co
Sum
0
1✗
0
1
0
1
1
0
0
1
1
0
1
0
1
0✗
Clk
V+ΔV
VA
VO
V-ΔV
V+ΔV
Inputs
VB
V
V-ΔV
V+ΔV
VC
V-ΔV
Fig. 21. 3-input MJG proposed by Angizi et al. [66]
Both approximate and accurate computation modes have nearly equal delays, but this delay is
significantly greater than that of a CMOS-only adder. By using pipelining, the delay can be slightly
reduced. The approximate mode consumes lower amount of power than the accurate mode while in both
the modes, the proposed adder has lower energy consumption than its CMOS-only counterpart. They
illustrate use of their adder for evaluating discrete cosine transform on images. The LSBs of pixel-values
are processed in approximate mode while MSBs are processed in accurate mode. Changing the number of
LSBs and MSBs that use approximate and accurate processing provides a varying degree of approximation
[73]. They show that for all levels of approximation, an implementation of discrete cosine transform on
their proposed platform consumes lower energy than that on CPU and CMOS-only platforms.
Comments: The adders proposed by Angizi et al. [66] and Roohi et al. [17] both use 3-terminal DW
motion devices and rely on MJGs for functioning. However, the former design is able to perform both
approximate and accurate computations which makes it suitable for error-tolerant applications such as
image processing.
Cai et al. [33] present two approximate full-adders based on STT-MTJs. Their first design is implemented using reduced logic complexity and is shown in Figure 22. Let A, B and Ci be the inputs and
Sum and Co be sum and output carry, respectively. A is provided during computation while the input B
is written to the MTJ. Their first adder ignores the input carry Ci while calculating the sum Sum. Thus,
the sum is computed as Sum = A XOR B . The output carry Co is computed accurately without ignoring
Ci .
Their second design, which is also shown in Figure 22, can compute both accurate and approximate
outputs. It operates on inexact writing of input B by providing an insufficient writing current that is less
than the “critical current” of the MTJ. This design does not ignore Ci while evaluating the sum S unlike
their first design. To facilitate comparison, a parameter termed “error distance” is used which provides
bit-by-bit comparison between the approximate output (x) and the accurate
) for all possible
P output (yP
combination of adder inputs. The error distance is computed as ED(x, y) = | p x[p]2p − q y[q]2q |. Here, p
and q are indices of bits of x and y , respectively. They show that the “error distance” of the first and second
adders are 4 and 6, respectively and thus, the first adder is more accurate. Further, both the approximate
adders consume lower dynamic and leakage powers than CMOS approximate adders. Among the two,
20
Sum
Sum
SA
A
A
A
A
Ci
A
Ci
Ci
Co
Co
SA
Ci
Ci
B
B
B
B
Clock
Fig. 22. Design of the second adder proposed by Cai et al. [33] implemented using low write current. The first adder, which is based
on reduced logic complexity, is implemented by excluding the circuitry present inside the dotted lines.
the second approximate adder, which operates on inexact writing, consumes lower energy but has much
higher delay than the adder operating on reduced logic.
4.3
Multiplier designs
Luo et al. [56] present a DWM-based multiplier. The proposed design is based on radix-4 Boothmultiplication since it provides highly efficient operation for binary-multiplication. Booth-multiplication
works by parallel calculation of partial sums and their summation [74]. For this purpose, the multiplier bits
are divided in bunches of three bits such that they overlap by one bit. Partial product is generated based
on the radix-4 encoding scheme given in table 8 and an illustration is given in Figure 23. The multiplicand
is stored on a single racetrack strip while each bit of the multiplier is stored on different strips so that they
can be accessed concurrently to provide a high degree of parallelism.
Read/write head
Multiplicand (-73)
1 0 1 1 0 1 1 1 (0)
(+90)
0 1 0 1 1 0 1 0 (0)
Multiplier
+1
+2
-1
Adder
Memory cell
Padding
+1
0 0 0 0 0 0 0 0 0
0 1 0 0 1 0 0 1 0
Operand
racetracks
Operand
racetracks
0 0 0 1 0 0 1 0 0 1 0
0 0 1 0 0 1 0 0 1
0 0 0 0 1 1 0 1 1 0 1 1 0
1 0 1 1 0 1 1 1 0
1 1 1 1 0 0 0 1 0 0 1 0 1 1 0
1 1 0 1 1 0 1 1 1
(-6570)
1 1 1 1 0 0 1 1 0 0 1 0 1 0 1 1 0
Result racetracks
(a)
(b)
Fig. 23. (a) Illustration of Booth multiplication and (b) pipelined addition used in the adder design by Luo et al. [56]
The proposed design uses a pipelined approach for addition where the partial products are stored
in strips with multiple access ports. This is shown in Figure 23(b). Each pair of strips is associated with
three adders, two of which are located at the ends of the racetrack and one in the center. The adders
on the left and right each take two operands and their results are summed by the adder in the middle.
This implements addition of partial products with minimal number of racetrack strips. By virtue of using
Booth-multiplication algorithm, parallel fetching of multiplier and subsequent pipelined addition, their
design performs high speed in-memory multiplication with minimum hardware.
21
TABLE 8
Encoding scheme for calculating the partial product [56]. X, Y and Z are the bits of multiplier, divided in bunches of three bits.
Partial product is obtained by multiplying the “multiplication factor” with the multiplicand.
X
0
0
0
0
1
1
1
1
4.4
Y
0
0
1
1
0
0
1
1
Z
0
1
0
1
0
1
0
1
Multiplication factor
0
1
1
2
-2
-1
-1
0
Majority gate based designs
Butzen at al. [30] propose an STT-MTJ based spintronic “majority voting” circuit which is used in “triple
modular redundant” architectures for achieving fault-tolerance. It consists of three MTJs to store the
operands, an SA and writing circuits. Initially, the input voltages are compared with the reference
voltage which results in a current that writes the input to the MTJs. Next, the stored values are read,
majority function is evaluated by the SA through comparison with a reference value that is empirically
determined. The proposed design has low power consumption since the MTJs have near-zero standby
power dissipation. Due to the low read latency of MTJ, their design has high performance. Fault in
writing to one of the MTJs is tolerated as the other two inputs provide correct output while evaluating
the majority function. This, combined with process variation tolerance of the devices, make the proposed
circuit highly reliable.
An et al. [42] present two all-spin logic based ALU designs. Both these designs use 5-input MJGs as
fundamental blocks. Both ALU designs can perform the same set of add, subtract, increment, decrement
and Boolean operations. The first design, shown in Figure 24(a) is constructed by all-spin logic based
circuit design. It uses three MJGs, 2 control signals and 10 select lines for the MUX. The second design is
constructed by realizing the basic functions as a combination of a full adder and a multiplexer. As shown
in Figure 24(b), this design uses 14 MJGs, three control signals and only two select lines. Results show
that out of the two designs, the former is superior in terms of energy efficiency, area and operational
speed since it requires far less number of MJGs. However, configuring the first design is a challenging and
tedious task since it has 10 select lines on the MUX. Also, constructing a control unit to integrate the first
design into a computing system leads to high design complexity.
Yao et al. [40] propose a spintronic ALU than can perform addition, subtraction and basic Boolean logic
operations. It is built on a three-input MTJ element as shown in Figure 25(a). All three input currents A, B
and Z have magnitudes greater than the “critical switching current”. The direction of the currents denote
high ‘1’ and low ‘0’ logic states. The sum of the three currents is responsible for switching the MTJ. The
MTJ state (M ) is the output of the Boolean function M = A · B + (A + B) · Z . By using Z as a control signal
and grounding the top electrodes, the MTJ can be used to implement AND (Z =0) and OR (Z =1). Three
such MTJs are combined to form a “fundamental logic unit” of the ALU as illustrated in Figure 25(b). The
three MTJs connect through two NMCs that act as media for transferring logic states between MTJs. By
activating control signals M1 and M2, the input is communicated to output via the NMCs.
By combining the fundamental units, their proposed ALU can perform addition, subtraction and
Boolean operations. The ALU operates in 3 steps: (1) the inputs are programmed (2) the control signals are
activated for the required operation and (3) the output is read by SA.
Comment: The spin ALUs, such as the one proposed by An et al. [42], rely only on spin-currents and
not charge based currents. ALUs proposed by Patil et al. [41], Yao et al. [40] and Lokesh et al. [38] are
spintronic designs since they make use of both spin and charge based currents. Also, the design of An
et al. [42] is capable of performing both 3-input and 2-input AND/NAND, OR/NOR and XOR/XNOR
while the other ALUs proposed [38, 40, 41] are restricted to 2-input operations.
22
𝐼𝑝𝑜𝑠
𝐼𝑧𝑒𝑟𝑜
𝐼𝑛𝑒𝑔
Mux
𝐼𝑝𝑜𝑠
𝑆10
𝑆1
A
C
U
B
B
𝐼𝑧𝑒𝑟𝑜
𝐼𝑛𝑒𝑔
Mux
𝑆1
A
C
M1
B
H’
F1
M3
F2
A
V
A BC UV
C
V
A BC UVH
#
F0
M3
H’
H
C’
B’
M7
C
U
M8
A’
B’
M11
A
H
M4
#
M2
U
H’
M2
#
F0
M1
H’
𝑆1 ′
M12
C
V
U
M5
H
M9
A’
M13
C’
B’
A’
B’
H’
B’
M6
M10
M14
F1
(b)
(a)
Fig. 24. (a) ALU constructed using all-spin logic circuit design method (b) ALU constructed using majority gate synthesis as
proposed by An et al. [42]. Ipos , Ineg and Izero denote spin currents with positive, negative and zero spins respectively
M1
M2
M2
M1
R
R1
A
B
(a)
C
A
B
R2
A
C
B
C
(b)
Fig. 25. (a) MTJ proposed by Yao et al. [40] that performs Output = A · B + (A + B) · C (b) Fundamental logic unit of ALU [40]
4.5
LUT designs
Yu et al. [59] propose a DW nanowire based NN accelerator. They map an “extreme learning machine based
super-resolution” (ELMSR) algorithm to their accelerator. The computations, which are performed most
frequently by this algorithm are weighted summation and sigmoid. To facilitate these two computations,
the accelerator is equipped with two types of PIM units, XOR and LUTs, both of which are designed with
DWM nanowires [58]. Weighted summation is implemented by adders and multipliers are composed of
XOR units. Sigmoid function is implemented with the help of a nanowire LUT. The structure of the overall
PIM architecture is an H-tree design similar to that used by Angizi et al. [64]. PIM logic elements are
distributed and integrated with memory units so as to reduce communication with the external processor
and provide thread-level parallelism. Their NN architecture achieves lower energy consumption and better
throughput than processing on a CPU.
Wang et al. [58] present a DW nanowire based PIM accelerator that performs multiplication for bigdata applications. The proposed architecture makes use of LUTs to implement Boolean logic functions,
while DW shifting is used to directly implement XOR and bitshift operations. Structure of XOR logic unit
is shown in Figure 26(a). The operands are stored in separate nanowires. The operation is performed on a
special read-only cell that has a structure similar to an STT-MTJ, but one in which both the ferromagnetic
layers are free. Each of the two operands (A and B ) are connected to one of the ferromagnetic layers of the
read-only cell and are shifted into it. The resulting orientation of the free-layers gives P XOR Q. An LUT
is implemented on a single nanowire by dividing it into two segments as shown in Figure 26(b), a “data
segment” stores programming of the LUT while another “reserved segment” functions as an extra space
23
so that the data is not lost while shifting. The read/write head acts as the sensing port.
Shift direction of operands
Read direction
Reserved segment
Data segment
BL
Shift
A
Access port
WL
Shift
B
BL
WL
(a)
(b)
Fig. 26. (a) XOR logic unit (b) LUT design proposed by Wang et al. [58]
An array of LUTs along with row and column decoders works as a PIM platform. This is shown
in Figure 27. Multiplication for big-data applications is implemented using “MapReduce” technique in
which a single multiplication of large size vectors is broken down into multiplication of smaller vectors,
and the intermediate results are combined to give the final result. For this purpose, the LUTs in the array
are divided into three groups. One group of LUTs are configured for multiplication, another group for
Boolean logic operations while a third group is configured to make a controller.
Fig. 27. LUT array proposed by Wang et al. [58]
The mapping of multiplication is illustrated in Figure 28. The multiplication workload is compiled
into list of tasks and saved in memory to facilitate concurrent operations. The matrix M is broken into
units comprising of rows only such that every task requires only “dot-product” of vectors. The controllers
fetch tasks from the queue and corresponding data and dispatch it to the mappers. This is an iterative
process that continues till the queue is empty. Each result of the mapper is examined and combined with
related results by the reducer till no further combination in possible. The final result is written back onto
the memory. Their accelerator has higher latency but greater throughput compared to an implementation
on a multicore platform. Also, it achieves higher density and lower power consumption by virtue of the
non-volatile nature of nanowires.
5
S PINTRONIC
ACCELERATORS FOR VARIOUS APPLICATIONS
In this section, we review spintronic architectures in terms of their application domains, such as
neuromorphic computing (Section 5.1), image processing (Section 5.2), data encryption (Section 5.3) and
associative computing (Section 5.4).
24
Fig. 28. Mapping of multiplication to PIM accelerator proposed by Wang et al. [58]
5.1
Neuromorphic computing
In neuromorphic architectures, synapses work as the memory element and store weights of various inputs.
The neurons process synaptic inputs to generate the output. Table 9 classifies the works based on their
proposed NN architectures, neuron and synapse models and strategies for performing thresholding. Table
9 also shows the benchmarks used for evaluation by different works.
TABLE 9
A classification of neuromorphic computing architectures
Strategy
Reference
NN architecture
Spiking neural network
ANN/CNN
Binary CNN
Extreme-learning machine
[2, 31]
[2, 3, 32, 47, 48, 60–62, 68]
[47, 48]
[59]
Neuron model
MTJ neuron
LSV neuron
Domain wall motion neuron
[2]
[2, 3]
[2, 32, 61]
Synapse model
MTJ synapse
[31]
Domain wall motion synapse
[2, 3]
Domain wall motion neuron with MCA synapse
[32, 61]
Thresholding performed by
Bennett clocking
[3]
Spintronic comparator based on LSV and LUT
[68]
Spin torque switches
[61]
Benchmarks used for evaluating NN accelerators
Character/digit recognition
[3, 32, 47, 62, 68]
Object detection/classification
[31, 48, 60, 68]
Face/edge detection
[61, 68]
Motion detection
[61]
Sharad et al. [3] propose a spintronic ANN accelerator that is based on a nano-magnetic neuron
and a domain wall motion neural synapse. The neuron model utilizes “lateral spin valves” to form
“majority logic” gates. The input currents flowing into the MJGs get spin-polarized by their respective
magnets. These currents have three “spin components” (one each along x, y, and z-axes) and one “charge
component”. The charge component flows to the ground while the spin-components induce STT effect
and their resultant effect is responsible for switching of the output magnet.
A 3-terminal DW device is used to model the synapse. When a current flows vertically into the channel
25
via the DW magnet, its degree of spin-polarization changes in proportion to the displacement of the DW
from the center of the magnet. These variations in spin polarization are used to implement programmable
weights. The extreme positions of a small DWM device are used as binary weights while longer nanowires
are used for non-binary weights. In order to reduce the injection current for synapses, “Bennett clocking”
is used [75] whereby the magnet is switched to a meta-stable state and from this, the magnet can be
transitioned to either of the stable states with minimal current.
The structure of neuron with DW synapse is shown in Figure 29(a). A preset current forces the firing
magnet (free layer of neuron MTJ) into its meta-stable configuration. Once this current is removed, the
magnet orients itself based on the spin components of the input currents, and thus, the firing-MTJ acquires
parallel or anti-parallel state. Weighted sum is computed as summation of “spin-polarized currents” in
the metallic channel while thresholding is achieved through “Bennett clocking”. The number of possible
input synapses is limited since the spin-polarizing strength of current decays with the length of the NMC.
Firing neuron
MTJ
Domain wall
synapse
Input
Current
Metallic
channel
(a)
Latch
MOS
Switch
Firing
MTJ
Current to
fan-out
neurons
Neuron with
synapses
(b)
Fig. 29. (a) LSV neuron with DW synapse proposed by Sharad et al. [3]. (b) Charge based signaling [3]
The limitation of the proposed spin neuron-synapse units is that they cannot be networked through
spin-signaling because nano-magnetic channels have very low spin-diffusion lengths. Hence, their proposed ANN uses a CMOS based charge-signaling as shown in Figure 29(b). One end of a differential latch
is connected to a reference while the other is connected to the firing-MTJ of a neuron. The output through
the transistors provides input-currents to other fan-out neurons. For “character recognition” benchmark,
their proposed ANN accelerator consumes lower power than both analog and digital CMOS ANN
accelerators. The area of their proposed design is lower than that of digital CMOS ANN and comparable
to that of an analog CMOS ANN.
Sengupta et al. [2] present implementation of artificial neurons and neural synapses with spintronic
devices. They model three different types of neurons: step, non-step and spiking neurons. For step neurons,
they present three implementations. The first is an STT-MTJ based implementation. The step functioning
of the neuron is directly mapped to switching of the MTJ. For such a neuron, higher operating voltages
are needed. This, combined with the large critical current, leads to high energy consumption.
The second implementation is based on LSV, as shown in Figure 30(a). Here, magnets m2-m4 are
input magnets. The “excitatory” and “inhibitory” currents through m2 and m3 get “spin-polarized”
corresponding to the polarity of the magnets. The two spin-polarized currents exert opposing STT-effect
on output magnet m1 whose final state is determined by the difference in magnitude between the two
currents. The preset current reduces critical current for switching. The third implementation is based on a
SOT-MTJ. Step operation is performed in two steps. First, a current is sent via the heavy metal to orient
the “free layer” along the “hard axis”. Then, an input “synaptic current” is passed through the pinned
layer which leads to switching of the MTJ.
The “non-step neuron” is based on a 3-terminal SOT-driven DWM device which is shown in Figure
30(b). During the write operation, a “synaptic current” across T2 and T3 leads to displacement of the DW
in proportion to the current magnitude. During the read operation, T1 and T3 are enabled and an “axon
circuit” is used to provide an output current. They further present an “integrate-fire spiking neuron”
which is shown in Figure 30(c). It is implemented using the same DWM device used for the “non-step”
neuron. In a time-interval, the DW is displaced in proportion to the magnitude of synaptic current. It
continues to accumulate input pulses in the form of DW displacements until it has reached the opposite
26
T1
(a)
T2
Heavy metal layer
(c)
T1
(b)
T3
Neuron MTJ
Excitatory
m2
current
Metallic channel
Pinned Layers
m1
m3
Inhibitory
current
Neuron MTJ
m4
T2
Heavy metal layer
Pre-set
current
T3
Fig. 30. (a) LSV based step neuron proposed by Sengupta et al. [2] (b) 3-terminal DWM device (c) “Integrate-fire spiking” neuron
end. The read circuitry present at the end detects it and utilizes the axon circuit to generate a spike. The
displacement of the domain wall determines the resistance across T1 and T3. This property is exploited to
model a programmable weight or neural synapse using the same device.
They extend the idea of spintronic neurons to a spintronic neuromorphic processing architecture. It is
based on the 3-terminal DWM and has a crossbar structure. A spiking neural network is implemented
using the integrate-fire spiking neuron and domain wall neural synapse mentioned above. Results show
that the proposed non-step and spiking neurons consume lower area and energy than their CMOS
counterparts. As for spiking neural network, both neuron and synapse are modeled using the same device.
The PIM capability of their model makes it superior to the CMOS implementation.
Fan et al. [32] propose a “soft-limiting non-linear neuron”. It is based on a 4-terminal STT driven DWM
device, which is shown in Figure 31. It has two current paths, lateral and vertical. One port along the
lateral path is maintained at a constant voltage while the other is used as a programming port. The neuron
works in three phases. First, the total synaptic current or weighted summation of inputs is supplied to
the programming port. This results in a change in position of the DW and the displacement varies with
the magnitude, direction and duration of applied programming current. In the second phase, the vertical
current is passed and a “voltage divider circuit” is utilized for sensing the state of MTJ. In the third phase,
the DW is set to its initial position.
Sensing port
Programming
port
Sensing port
Held at constant
voltage
Fig. 31. 4-terminal DWM device used by Fan et al. [32] to model “soft-limiting non-linear neuron”
They further present an ANN architecture consisting of an array of the proposed neurons coupled
with an MCA which serves as synapse. In this architecture, the functions of an “axon” are performed
by transistors. The proposed soft-limiting neurons offer a continuous change in resistance corresponding
to inputs resulting in improved accuracy and reduced network complexity when compared to ANNs
implemented using hard-limiting neurons. Compared to hard-limiting neurons, their proposed softlimiting neurons lead to smaller area of hidden-layers in ANN models. The proposed neuron consumes
significantly lower energy than the CMOS-ANN implementations.
Vincent et al. [31] present an STT-MTJ based stochastic neural synapse. Switching time of an STT-MTJ
is a stochastic quantity as it determines the probability of switching and is dependant on the switching
current. By controlling the magnitude of write current such that it is kept lower than the “critical current”,
the stochastic nature of the STT-MTJ is exploited. MTJs are organized in a crossbar structure, each of them
connecting an input neuron to an output neuron. When an input neuron spikes, currents are set up in
the crossbar array and reach the output neurons. Through the use of SAs, the orientation or logic state of
the synapse is determined. Anti-parallel oriented MTJs act as synapses with ‘zero’ weight while parallel
oriented ones have the weight ‘one’. Firing of an output neuron leads to a “voltage pulse” on the crossbar
array. This voltage pulse results in a switching probability of the MTJ synapse which is determined by
the synapse’s activity in the previous time-interval. This probabilistic switching mechanism is used to
27
implement synaptic learning through a “spike timing dependent plasticity” model. Results show that the
use of controlled write current results in low energy consumption and their proposed design is robust to
device variations.
Angizi et al. [47] propose a SOT-RAM based accelerator for low bit-width CNN. The design performs
convolution in a bit-wise manner on binary inputs and corresponding weights using PIM approach.
Figure 32(a) shows the architecture of their accelerator. It consists of an “image bank”, “kernel bank”,
“convolution engine” and a “digital processing unit” (DPU). The input vectors are mapped onto the
“image bank” and the weights are mapped onto the “kernel bank”. These vectors are quantized by the
DPU. Dot product is evaluated as a combination of bit-count and AND operations. The AND operation
is performed in-memory, in the SOT-RAM sub-array using a reference voltage based approach similar to
that used by Fan et al. [45]. The bit-counter counts the number of ones in the resultant vectors of the AND
operation and passes it to the bit-shifter which left-shifts the vectors. The result so obtained is the partialsum of the corresponding sub-array. The partial sums of all sub-arrays involved are combined to obtain
the final result. This result is passed to the DPU for batch normalization and evaluation of the activation
function.
Image bank
Image bank
Activation function
DPU
Quantization
Batch-normalisation
SOT RAM
Sub array
Bit-counter
Bit-shifter
SOT RAM
Sub array
Bit-counter
Bit-shifter
Partial-sum
Partial-sum
Bit-count
Batch-normalization
Convolution
engine
Scaling factor
Multiplier
DPU
Kernel bank
Convolution
engine
Kernel bank
Pooling
XOR
(a)
(b)
Fig. 32. Design of the accelerators proposed by (a) Angizi et al. [47] and (b) Fan et al. [48]
AND, bitcount and bit-shift operations are performed in-memory in a fast and parallelized manner,
thereby accelerating the MAC operations of the CNN. Running a binary-weight AlexNet on the proposed
architecture shows that it consumes lower energy than an RRAM based implementation.
Fan et al. [48] propose an accelerator for a binary CNN based on the SOT-RAM PIM platform [45, 50].
The design of the accelerator is depicted in Figure 32(b). Their accelerator is modeled after XNOR-NET
[76] architecture which is a binarized AlexNet network. The input image and weights are mapped to the
image bank and kernel bank, respectively. The weight tensors are converted from -1 and +1 to 0 and 1,
respectively. Thus, convolution is achieved through bitwise AND and bit-count operations [77]. Bitwise
AND is performed in the memory itself using the SOT-RAM based “convolution engine”. This is followed
by the “bit-count” operation. The DPU accompanying each block performs other computations such as
batch-normalization, scaling and pooling. The proposed accelerator achieves acceleration due to its ability
to perform convolution within memory itself. Movement of data in and out of memory is greatly reduced.
Results show that the accelerator consumes lower area and energy than an RRAM based accelerator.
Comments: The accelerator proposed by Angizi et al. [47] is similar in design and working to the
accelerator presented by Fan et al. [48]. However, the former design utilizes the DPU to quantize inputs,
while the accepts binarized inputs. Also, the in-memory bit-shift operation used by Angizi et al. is not
used in the accelerator proposed by Fan et al. and the DPUs in both accelerators are equipped to perform
different functions.
Chung et al. [60] present a racetrack based accelerator for the convolution layer of CNNs. In CNNs,
a large fraction of computations are performed in the convolution layers and the primary kernel of
convolution layers is matrix multiplication. Thus, by accelerating matrix multiplication, the performance
of CNNs can be greatly increased. The accelerator proposed by Chung et al. [60] leverages PIM approach
to compute dot-products. It consists of nanowire input registers, a racetrack array, two accumulators and
28
adders. The weights are stored in the nanowires while the inputs are provided dynamically through the
transistors. The racetrack sub-array performs the dot-product while the final result of multiplication is
obtained after passing the partial-dot product through the ADC sub-array. Figure 33 shows a comparison
between MCA, SRAM and DWM based dot-product engines. In Figures 33(b) and 33(c), the ADCs are
included in the ”dot-product” blocks. Compared to an MCA implementation, their proposed design
provides comparable throughput while consuming lower energy. Their proposed PIM dot-product engine
is especially useful as a CNN accelerator.
SRAM arrays
(64 × 64-bit)
4-bit DACs
(64 × 1)
Input
(64 × 4-bit)
DWM arrays
(128 × 64-bit)
MCA
(64 × 64-bit)
4-bit ADCs (1 × 64)
Output (64 × 4-bit)
(a)
Address
(6-bit)
64-bit
Input
Dot product
(64×4-bit)
(Multiply and add)
Output (4-bit)
(b)
Address
(6-bit)
64-bit
Input
Dot product
(64×4-bit)
(Multiply and add)
Output (4-bit)
(c)
Fig. 33. Sixty-four “dot products” with sixty-four four-bit weights using dot-product engines implemented using (a) MCA (b) SRAM
(c) DWM Racetracks [60]
Comments: By virtue of storing multiple bits in the nanowire racetracks, the technique of Chung et al
[60] achieves longer bit-width processing compared to that proposed by Sharad et al [3] and Sengupta et
al [2].
Sengupta et al. [62] propose a spintronic-based ANN accelerator. Both neurons and synapses are
designed with a 3-terminal SOT driven DWM device similar to the one used in [2]. The functions of the
neuron and synapses are mapped to separate DWM devices. A “spintronic axon circuit” is used to enable
networking of neurons. They implement a “feed-forward ANN” having a hidden layer which is fullyconnected to the output layers and this design is illustrated in Figure 34. The hidden layer and the output
layer are mapped to crossbar arrays and connected through “axons”. Input voltages Vi proportionate to
image pixels are affected along the rows while the position of the DW at every cross-point represents the
“synaptic weight”. If Gij is the conductance of synapse between ith input and j th neuron, and Rj is the
resistance of the neuron’s path then the synaptic current flowing into the neuron is given by
P
Gij · Vi
Ij = i
(1)
1+γ
P
where γ = Rj i Gij . When Rj is very small compared to P 1Gij i.e., γ ≪ 1, the voltage drop across the
i
neurons can be neglected. In such a case, Ij gives the weighted addition of weights and inputs thereby
providing the functionalities of a neuron. Since the proposed spintronic neurons can be operated at much
lower voltages than the crossbar array, their accelerator consumes much lower power than its analog
counterpart.
Ramasubramanian et al. [68] present a deep-neural network accelerator which relies on DWM-based
implementation of neurons [2, 3]. These spin neurons are coupled with MCA synapses to form a network of
neurons similar to the ones presented by Roy et al. [61] and Fan et al. [32]. The proposed architecture uses
an array composed of 3-terminal DWM devices as the memory. Their design has a three-tier hierarchical
structure as depicted in Figure 35.
In the lowest tier, “spin neuron arrays” (SNA) are formed by combining the spin neuron network with
peripheral circuitry. SNAs are used for performing thresholding and convolution operations. In the next
tier, “spin neuromorphic cores” (SNC) are formed by combining several SNAs with dispatch units and
local memory. In the highest tier, “SNC clusters” are formed by combining several SNCs via a local bus.
The three-level hierarchy allows the proposed architecture to match the nested parallelism of deep neural
networks. The number of SNAs in each SNC and number of SNCs in each SNC cluster can be changed
29
Hidden layer
crossbar
Axon
Output layer
crossbar
Inputs
Synapse
Neuron
Fig. 34. ANN accelerator design proposed by Sengupta et al. [62]
Dispatch
unit
Spin memory
SNA
SNA
SNA
SNA
SNA
SNA
SNA
SNA
SNA
Intra-cluster bus
SNC
SNC Clusters
Fig. 35. Three-tier NN accelerator design proposed by Ramasubramanian et al. [68]
to obtain different points of the energy-speed tradeoff. Since spintronic crossbar array can operate at
much lower voltage than their CMOS counterparts, their accelerator consumes much lower energy than
an analog-CMOS based implementation.
5.2
Image processing
He et al. [24] propose an STT-RAM array that works as an NVM and a reconfigurable PIM platform. It
is based on a standard STT-RAM array and uses revised column and row decoders for enabling either a
single line for memory read/write, or two lines for PIM. It has a modified sensing circuit consisting of
two SAs, and a reference generator circuit which provides reference values to the two SAs. The two SAs
evaluate NAND, AND, NOR and OR simultaneously. From these functions, XOR and XNOR functions are
generated using a CMOL combinational circuit. A MUX selects the desired output from the six possible
outputs.
They further propose a novel edge detection algorithm that makes use of the proposed PIM array. In
the case of binary images, the entire image is stored in the memory array. Four neighboring bit-cells are
simultaneously selected and the reference values of the SAs are set such that the edge-detection algorithm
is intrinsically implemented. This is equivalent to a sliding window in conventional image processing.
The algorithm is extended to N -bit grayscale images by dividing the image into N bit-planes from MSB
to LSB and applying the algorithm to each bit-plane separately. The plane-wise results thus obtained
are combined through in-memory pixel-wise OR operation to obtain the final result. Their technique
consumes lower energy than the CMOS implementation of conventional edge detection algorithms. Their
design allows executing complex algorithms on PIM platforms that are otherwise considered suitable only
for basic Boolean functions.
Roy et al. [61] propose a DWM based architecture for non-Boolean computing. Three-terminal DWM
devices are used to model thresholding neurons. These neurons are networked through a CMOS latch
based signaling system similar to the one used by Sharad et al. [3]. The latches control the transistors which,
in turn, supply synapse currents to the fan-out neurons. A transistor corresponding to a negative weight
30
acts as drain, while that corresponding to a positive weight acts as a current source. The proposed NN
architecture combines the three-terminal DWM devices and an MCA. The DW devices serve as neurons,
while the MCA acts as neural synapse and the CMOS signaling system enables efficient connectivity
between them. Simulations show that the thresholding operation is more efficient in spin-based neurons
than in their CMOS counterparts. They evaluate their design using image processing algorithms such as
edge-detection, motion-detection and digitization. Due to the low-operating voltage of the neurons, their
proposed architecture consumes lower energy than the advanced mixed-signal CMOS implementations.
Natsui et al[78] propose an automated design environment for MTJ-based large scale integration. The
proposed design environment consists of a combination of standard EDA tools and newly developed
customized tools and libraries. The flow diagram of the design procedure along with a comparision of
conventional and newly developed techniques is shown in Figure 36(a). The “Nanolib” tool generates
technology file for a given circuit netlist. The output comprises functional, structural, enviromental and
timing information. “ns-spice mtj” is a SPICE simulation model for MTJs. The custom developed circuit
simulator combined with “ns-spice mtj” allows generation of MTJ instances using a single line of code
just like that of transistors. Specialized libraries are created using the circuit simulator and “Nanolib” for
MOS/MTJ hybrid cells. Also included is a HDL preprocessor “Vericonv” which converts a HDL netlist
into a Verilog netlist.
Minimum SAD
determination
block
Control
5x5 PE
PE(0,0)
PE(0,4)
PE(4,0)
PE(4,4)
Candidate
window data
PE(4,1)
Fig. 36. Architecture of the motion-vector prediction unit proposed by Natsui et al. [78]
Next they design and fabricate a motion-vector prediction unit using the proposed design environment.
The architecture of the unit having an 8×8 candidate window along with a 4×4 search window for 8-bit
images is shown in Figure 36. It consists of 25 processing elements (PEs), each of which contains 16 8bit non-volatile adders. Power consumption of the PEs is reduced by precisely controlled power supply
over every operation cycle. This method of power gating is favourable to implement since the nonvolatile nature of the MTJs allows power supply to be controlled without worrying about data retention.
This approach reduces the static power dissipation significantly . Power efficiency is also enhanced by
increasing the granularity. Their motion-vector prediction unit has lower leakage power and satic power
consumption than the CMOS-only designs. Also, the higher packing density and ability to embed MTJs
over CMOS structures to form 3-D circuits leads to reduced area requirements.
5.3
Data encryption
The PIM capability of spintronic memories offers unique advantage for data encryption. Performing
encryption directly in memory obviates the requirement of loading the data in volatile memory, performing encryption them using logic unit and then storing back in memory. Thus, energy/bandwidth
overheads and security risks of data-movement [79] are entirely avoided. This also allows reaching
the level of throughput and energy efficiency required for big-data applications. Figure 37(a) shows
the flow-diagram of AES data-encryption and Figures 37(b) and Figure 37(c) illustrate ShiftRows and
31
MixColumns operations, respectively. We now review several works that propose spintronic accelerators
for data encryption.
Input
AddRoundKey
N-1
iterations
S3,1 S3,2 S3,3 S3,4
S3,3 S3,4 S3,1 S3,2
S2,2 S2,3 S2,4 S2,1
S4,4 S4,1 S4,2 S4,3
State matrix
(a)
(c)
AddRoundKey
SubBytes
Final
iteration
S1,1 S1,2 S1,3 S1,4
S2,1 S2,2 S2,3 S2,4
S4,1 S4,2 S4,3 S4,4
Encrypted output
Shifted matrix
MixColumns
S3,1 S3,2 S3,3 S3,4
AddRoundKey
Key
expansion
S1,1 S1,2 S1,3 S1,4
S4,1 S4,2 S4,3 S4,4
MixColumns
ShiftRows
S1,1 S1,2 S1,3 S1,4
S2,1 S2,2 S2,3 S2,4
SubBytes
ShiftRows
ShiftRows
(b)
×
02
03
01
01
01
02
03
01
01
01
02
03
03
01
01
02
State matrix
Preset matrix
Fig. 37. (a) Flow diagram of AES data-encryption for N iterations [50] (b) illustration of ShiftRows and MixColumns operations
Angizi et al. [64] propose a PIM architecture for implementing data encryption based on a four-terminal
DWM device driven by spin-Hall effect. The structure of the logic blocks is shown in Figure 38. Each
H-tree shaped sub-array is split into two blocks. Every block consists of four memory cells and four
PIM logic units. The logic units are “threshold logic gates” (TLG) and XOR gates. The TLG is used
to implement majority, AND/NAND and OR/NOR functions. Since XOR accounts for a bulk of the
operations performed in data encryption, Angizi et al. use specialized XOR gates even though XOR can
be implemented with TLGs. In the computing mode, the TLG and XOR units are used for PIM operations
and in the memory mode, the TLG serves as another memory cell.
Mem
XOR
Mem
XOR
Mem
TLG
Mem
TLG
Logic Blocks
Fig. 38. Logic blocks with memory (Mem), XOR and TLG units organised into H-tree structures [64]
They further illustrate data encryption by implementing AES on the proposed architecture. The flow
diagram for AES and an illustration of ShiftRows and MixColumns transformations are shown in Figure
37. They propose three levels of parallelization for mapping the transforms. The first level uses 16 rows
of a memory unit to store a 4×4, 16-byte state matrix whereas the second level uses two memory units
simultaneously to hold two different state matrices. In a similar manner, this is extended to a higher level
of parallelization. Results show that with increasing degree of parallelization, the speed increases at the
cost of increased area and energy consumption. At highest degree of parallelization, their architecture
has lower “energy-delay product” than CPU, ASIC, CMOS and the baseline DWM implementations [80],
however, it has higher area than CMOL and baseline DWM implementations.
He et al. [67] propose mapping of AES data-encryption to the DWM nanowire based PIM platform
presented by Fan et al. [45]. The perpendicularly coupled nanowire crossbar array is equipped with row
and column decoders to facilitate accessing individual cells. The crossbar array performs in-memory XOR
[45] and bit-shift through DW shifting. Each crossbar stores a single row of the 4×4 state matrix. In
32
the “AddRoundKey” step, the state matrix is loaded into 4 nanowires such that each nanowire holds
one row. The key matrix is similarly loaded into the perpendicular nanowires and XOR of the matrices
is retrieved from the intersections. For the “SubBytes” step, the state matrix undergoes an LUT based
transformation. Implementation of LUT on nanowires is similar to the technique used by Wang et al.
[58]. In the “ShiftRows” step, each row undergoes shifting which is easily implemented through DW
shifting on the nanowire. For the “mix-colums” step, the addition and multiplication by 2 and 3 can be
implemented either as a combination of XOR and bit-shifts or with an LUT. Results show that the proposed
implementation consumes lower energy than CPU, ASIC and CMOL implementations of AES.
Wang et al. [65] present a DWM nanowire based PIM architecture and map AES data encryption to
this architecture. This method allows integration of AES ciphers and data encryption within memory. The
16-byte 4 × 4 state matrices are split into 8 4 × 4 state-arrays from MSB to LSB. Each row of such an array is
stored in a nanowire along with a few reserved-bits to facilitate shifting. The proposed design makes use of
the nanowire based XOR and LUT implementations proposed by Wang et al. [58]. For the “AddRoundKey”
step, the state-array is bit-wise XORed with a “key-array” through in-memory XOR operations. The
resultant state-matrix is subjected to a non-linear transformation by using a LUT in the “SubBytes” step. In
the “ShiftRows” step, the rows of the state-array are shifted cyclically. The redundant-bits in the nanowire
are used to form a “virtual circle” in the nanowire. Each row is shifted by different amounts i.e., the ith
row is shifted by i − 1 bits. For the last transformation viz. “MixColumns”, three operations are necessary:
multiplication by 2, multiplication by 3 and bitwise XOR. These can be implemented as a combination of
left shifts and bit-wise XOR, or directly using an LUT. All four transformations of the AES algorithm are
implemented without moving data out of the memory. Their proposed design has higher throughput and
energy efficiency and lower area than CPU, ASIC and memristive CMOL implementations, however, its
latency is larger than that of memristive CMOL and ASIC since DW based XOR and LUT need multiple
cycles due to shift operations.
Comments: The techniques of He at al. [67] and Wang et al. [65] are very similar in terms of mapping
AES to the proposed platforms. The main difference lies in the devices used i.e., the former uses a simple
crossbar array while the latter uses specialised constructs based on nanowire racetracks. Due to the change
in the devices used, the design of He et al. is simpler in terms of logic complexity but requires extensive
programming and control units. The design of Wang et al., on the other hand, has higher logic complexity
but requires a simpler control unit.
Fan et al. [50] propose a PIM platform based on three-terminal DWM devices. The array consists
of memory (‘Mem’) cells and memory/function (‘Mem/Function’) cells as shown in Figure 39. ‘Mem’
cells comprise of 2-transistors and 1 DWM device cells. The ‘Mem/Function’ cells have an extra access
transistor that is controlled by the “mode activation row decoder”. The ‘Mem’ cells present in a row are
used to store operands and are controlled by row-decoders whereas the ‘Mem/Function’ cells store the
output. PIM is achieved through implementation of “majority logic” gate since it forms a complete logic
set. All other Boolean functions are implemented as a combination of majority logic.
WSL
WBL
Voltage driver
RSL
RWL
‘Mem’ cell array
Mode
select
WWL
‘Mem’ Cells
‘Mem/function’ Cells
Mem/
function
cells
Mode
activation
row decoder
Row
decoder
Output
(a)
(b)
Fig. 39. (a) The ‘Mem’ and ‘Mem/function’ cells [50]. (b) The PIM platform proposed by Fan et al. [50]
During computations, a current flows from the ‘Mem’ cells to the ‘Mem/Function’ cell which represents
weighted summation of data stored in the ‘Mem’ cells. If the summation current is higher than the “critical
current” of DWM device, the DW moves to the opposite end, otherwise it remains at its initial position.
The result of the operation remains stored in the ‘Mem/Function’ cell in the form of DW displacement.
Their design can be used for implementing any 2-input logic gate as a combination of majority logic.
33
Implementation of AES on their accelerator achieves higher energy efficiency than that on CPU, ASIC and
CMOL.
5.4
Associative computing
Associative computing is based on the concept of associative search wherein data is accessed using the
content rather than the address. The searched data is processed in a massively parallel manner which
eliminates the need for accessing memory in each and every operation.
Guo et al. [23] present an STT-RAM based associative computation architecture. Their design leverages
PIM to reduce the address computation overhead associated with accessing data. The TCAM array is
composed of 2T-1MTJ cells, as shown in Figure 40(a). Each cell of the array is capable of performing write,
read and search operations. Search is performed by XORing the stored bit with the search-bit. The array
configured for such a search scheme is shown in Figure 40(b). Here, D or D’ is biased and the search-bit is
applied through the search-line.
B
L
B
L
WL
M
TJ
ML0
D
SA
ML1
ML2
ML
D
BL0
Read SAs
SA
SA
BL0
(a)
SA
BL1
BL1
Search SAs
SA
(b)
Fig. 40. (a) 2T-1MTJ bit cell [23] (b) Array structure with XOR gates for search [23]
Their technique employs bit-serial search, i.e., it follows an iterative approach by searching the array
column-after-column. Each row is equipped with additional logic circuitry such as SAs, flip-flops, and
multi-match circuits. Organization of the TCAM array is shown in Figure 41. After a search operation,
on-chip microcontrollers perform summation, match-count and indexing operations. The result of these
operations is stored in memory and subsequently retrieved by the processor. By virtue of using TCAM and
PIM to implement search operation, their architecture provides higher performance and energy efficiency
than a DRAM-based system. The limitation of their technique is that with increasing search-width, its
delay and energy consumption also increases.
Row/column decoders
Search
Read/Write
Memory-line
SAs for ‘search’
Additional
row
circuitry
Bit-line SAs for
‘read’
Fig. 41. Array organization of the PIM architecture proposed by Guo et al. [23]
Comments: The associative computing architecture proposed by Guo et al. [23] and the PIM accelerator
proposed by Jain et al. [26] both make use of 2T-1MTJ cells. However, the structures of these cells are
different. The bit-cell structure used by Guo et al. has four terminals while the bit-cell used by Jain et al.
has five terminals. The extra terminal used by Jain et al. is the “write line mode” (WLM) terminal which
is used to toggle between memory and PIM modes.
6
C ONCLUSION
AND
F UTURE O UTLOOK
Memory latency and bandwidth constraints have now become the key bottleneck in scaling the performance of modern processors. Although traditional techniques such as prefetching [81] and data-
34
compression [82] can mitigate these overheads partially, approaches that provide much higher efficiency
are required for architecting processors of next-generation. In this paper, we presented a survey of
spintronic-architectures for enabling “processing-in-memory” and designing accelerators for “neural
networks”. We conclude this paper with a discussion of future challenges.
Apart from performance, area and energy-efficiency, other metrics such as high reliability and high
yield at small feature-sizes, security and cost effectiveness also will determine whether spintronic memories will see wide-scale integration in product systems. Most research works, however, do not evaluate the
proposed architectures on all the metrics. Comprehensive evaluation of the spintronic architectures and
management techniques is required to establish their effectiveness soundly.
While conventional memories and compute-centric architectures fall way short of meeting the grand
challenges of AI, spintronic memories of today also are unable to meet these targets. Evidently, there is
a need of making concerted efforts across the entire software stack to address these issues. From deviceperspective, continuing the feature-scaling while reducing fault-rate will allow improving integration
density. At microarchitecture-level, hiding the large latency of these memories will require use of
techniques such as pipelining, prefetching and write-coalescing. As novel machine learning algorithms are
proposed and deployed in various applications, designing spintronic accelerators customized for different
algorithms and applications is required to ensure high efficiency.
With decreasing feature size, the error rate in processor components increases [83]. Due to this, the
functional units may become slow and/or faulty. In such cases, the PIM approach becomes even more
important. While previous works have studied PIM capability of spintronic architectures primarily for
energy and performance benefits, exploring the benefits of PIM for tolerating errors presents as an
promising research avenue in near future.
R EFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
S. Mittal and S. Nag, “A survey of encoding techniques for reducing data-movement energy,” Journal of Systems Architecture,
2018.
A. Sengupta and K. Roy, “A vision for all-spin neural networks: A device to system perspective,” IEEE Transactions on Circuits
and Systems I: Regular Papers, vol. 63, no. 12, pp. 2267–2277, 2016.
M. Sharad, C. Augustine, G. Panagopoulos, and K. Roy, “Spin-based neuron model with domain-wall magnets as synapse,”
IEEE Transactions on Nanotechnology, vol. 11, no. 4, pp. 843–853, 2012.
S. Mittal, “A Survey of Techniques for Architecting Processor Components using Domain Wall Memory,” ACM Journal on
Emerging Technologies in Computing Systems, 2016.
S. Mittal, J. S. Vetter, and D. Li, “A Survey Of Architectural Approaches for Managing Embedded DRAM and Non-volatile
On-chip Caches,” IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 26, no. 6, pp. 1524 – 1537, 2015.
X. Chen, E. H.-M. Sha, Q. Zhuge, W. Jiang, J. Chen, J. Chen, and J. Xu, “A unified framework for designing high performance
in-memory and hybrid memory file systems,” Journal of Systems Architecture, vol. 68, pp. 51–64, 2016.
S. Mittal and J. S. Vetter, “A survey of software techniques for using non-volatile memories for storage and main memory
systems,” IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 27, no. 5, pp. 1537–1550, 2016.
S. Peng, Y. Zhang, M. Wang, Y. Zhang, and W. Zhao, “Magnetic tunnel junctions for spintronics: principles and applications,”
Wiley Encyclopedia of Electrical and Electronics Engineering, pp. 1–16, 2014.
M. Wang, W. Cai, K. Cao, J. Zhou, J. Wrona, S. Peng, H. Yang, J. Wei, W. Kang, Y. Zhang et al., “Current-induced magnetization
switching in atom-thick tungsten engineered perpendicular magnetic tunnel junctions with large tunnel magnetoresistance,”
Nature communications, vol. 9, no. 1, p. 671, 2018.
I. Ahmed, Z. Zhao, M. G. Mankalale, S. S. Sapatnekar, J.-P. Wang, and C. H. Kim, “A comparative study between spin-transfertorque and spin-hall-effect switching mechanisms in pmtj using spice,” IEEE Journal on Exploratory Solid-State Computational
Devices and Circuits, vol. 3, pp. 74–82, 2017.
S. Mittal, R. Wang, and J. Vetter, “DESTINY: A Comprehensive Tool with 3D and Multi-level Cell Memory Modeling
Capability,” Journal of Low Power Electronics and Applications, vol. 7, no. 3, p. 23, 2017.
W. Kang, Y. Cheng, Y. Zhang, D. Ravelosona, and W. Zhao, “Readability challenges in deeply scaled stt-mram,” in Non-Volatile
Memory Technology Symposium (NVMTS), 2014 14th Annual. IEEE, 2014, pp. 1–4.
S. Mittal, “A survey of soft-error mitigation techniques for non-volatile memories,” Computers, vol. 6, no. 8, 2017.
S. Mittal, J. Vetter, and L. Jiang, “Addressing Read-disturbance Issue in STT-RAM by Data Compression and Selective
Duplication,” IEEE Computer Architecture Letters, vol. 16, no. 2, pp. 94–98, 2017.
W. Kang, Z. Wang, H. Zhang, S. Li, Y. Zhang, and W. Zhao, “Advanced low power spintronic memories beyond STT-MRAM,”
in Proceedings of the on Great Lakes Symposium on VLSI 2017. ACM, 2017, pp. 299–304.
J. G. Alzate, P. K. Amiri, P. Upadhyaya, S. Cherepov, J. Zhu, M. Lewis, R. Dorrance, J. Katine, J. Langer, K. Galatsis
et al., “Voltage-induced switching of nanoscale magnetic tunnel junctions,” in Electron Devices Meeting (IEDM), 2012 IEEE
International. IEEE, 2012, pp. 29–5.
A. Roohi, R. Zand, and R. F. DeMara, “A tunable majority gate-based full adder using current-induced domain wall
nanomagnets,” IEEE Transactions on Magnetics, vol. 52, no. 8, pp. 1–7, 2016.
35
[18] K. Huang, R. Zhao, and Y. Lian, “A low power and high sensing margin non-volatile full adder using racetrack memory,”
IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 62, no. 4, pp. 1109–1116, 2015.
[19] W. Kang, Y. Huang, X. Zhang, Y. Zhou, and W. Zhao, “Skyrmion-electronics: An overview and outlook.” Proceedings of the
IEEE, vol. 104, no. 10, pp. 2040–2061, 2016.
[20] X. Zhang, M. Ezawa, and Y. Zhou, “Magnetic skyrmion logic gates: conversion, duplication and merging of skyrmions,”
Scientific reports, vol. 5, p. 9400, 2015.
[21] Q. An, L. Su, J.-O. Klein, S. Le Beux, I. O’Connor, and W. Zhao, “Full-adder circuit design based on all-spin logic device,” in
Nanoscale Architectures (NANOARCH), 2015 IEEE/ACM International Symposium on. IEEE, 2015, pp. 163–168.
[22] H. Mahmoudi, T. Windbacher, V. Sverdlov, and S. Selberherr, “High performance MRAM-based stateful logic,” in Ultimate
Integration on Silicon (ULIS), 2014 15th International Conference on. IEEE, 2014, pp. 117–120.
[23] Q. Guo, X. Guo, R. Patel, E. Ipek, and E. G. Friedman, “AC-DIMM: associative computing with STT-MRAM,” ACM SIGARCH
Computer Architecture News, vol. 41, no. 3, pp. 189–200, 2013.
[24] Z. He, S. Angizi, and D. Fan, “Exploring STT-MRAM Based In-Memory Computing Paradigm with Application of Image
Edge Extraction,” in Computer Design (ICCD), 2017 IEEE International Conference on. IEEE, 2017, pp. 439–446.
[25] W. Kang, H. Wang, Z. Wang, Y. Zhang, and W. Zhao, “In-Memory Processing Paradigm for Bitwise Logic Operations in
STT–MRAM,” IEEE Transactions on Magnetics, vol. 53, no. 11, pp. 1–4, 2017.
[26] S. Jain, S. Sapatnekar, J.-P. Wang, K. Roy, and A. Raghunathan, “Computing-in-Memory with Spintronics,” DATE, pp. 1640–
1645, 2018.
[27] H. Mahmoudi, T. Windbacher, V. Sverdlov, and S. Selberherr, “MRAM-based logic array for large-scale non-volatile logic-inmemory applications,” in Nanoscale Architectures (NANOARCH), 2013 IEEE/ACM International Symposium on. IEEE, 2013,
pp. 26–27.
[28] S. Matsunaga, J. Hayakawa, S. Ikeda, K. Miura, T. Endoh, H. Ohno, and T. Hanyu, “MTJ-based nonvolatile logic-in-memory
circuit, future prospects and issues,” in Proceedings of the Conference on Design, Automation and Test in Europe. European
Design and Automation Association, 2009, pp. 433–435.
[29] F. Parveen, Z. He, S. Angizi, and D. Fan, “HielM: Highly flexible in-memory computing using STT MRAM,” in Design
Automation Conference (ASP-DAC), 2018 23rd Asia and South Pacific. IEEE, 2018, pp. 361–366.
[30] P. Butzen, M. Slimani, Y. Wang, H. Cai et al., “Reliable majority voter based on spin transfer torque magnetic tunnel junction
device,” Electronics Letters, vol. 52, no. 1, pp. 47–49, 2015.
[31] A. F. Vincent, J. Larroque, N. Locatelli, N. B. Romdhane, O. Bichler, C. Gamrat, W. S. Zhao, J.-O. Klein, S. Galdin-Retailleau,
and D. Querlioz, “Spin-transfer torque magnetic memory as a stochastic memristive synapse for neuromorphic systems,”
IEEE transactions on biomedical circuits and systems, vol. 9, no. 2, pp. 166–174, 2015.
[32] D. Fan, Y. Shim, A. Raghunathan, and K. Roy, “STT-SNN: A spin-transfer-torque based soft-limiting non-linear neuron for
low-power artificial neural networks,” IEEE Transactions on Nanotechnology, vol. 14, no. 6, pp. 1013–1023, 2015.
[33] H. Cai, Y. Wang, L. A. Naviner, Z. Wang, and W. Zhao, “Approximate computing in MOS/spintronic non-volatile full-adder,”
in Nanoscale Architectures (NANOARCH), 2016 IEEE/ACM International Symposium on. IEEE, 2016, pp. 203–208.
[34] L. A. de Barros Naviner, H. Cai, Y. Wang, W. Zhao, and A. B. Dhia, “Stochastic computation with spin torque transfer
magnetic tunnel junction,” in New Circuits and Systems Conference (NEWCAS), 2015 IEEE 13th International. IEEE, 2015, pp.
1–4.
[35] Y. Wang, H. Cai, L. A. Naviner, J.-O. Klein, J. Yang, and W. Zhao, “A novel circuit design of true random number generator
using magnetic tunnel junction,” in Nanoscale Architectures (NANOARCH), 2016 IEEE/ACM International Symposium on. IEEE,
2016, pp. 123–128.
[36] T. Hanyu, D. Suzuki, N. Onizawa, S. Matsunaga, M. Natsui, and A. Mochizuki, “Spintronics-based nonvolatile logic-inmemory architecture towards an ultra-low-power and highly reliable VLSI computing paradigm,” in Proceedings of the 2015
Design, Automation & Test in Europe Conference & Exhibition. EDA Consortium, 2015, pp. 1006–1011.
[37] T. Hanyu, D. Suzuki, A. Mochizuki, M. Natsui, N. Onizawa, T. Sugibayashi, S. Ikeda, T. Endoh, and H. Ohno, “Challenge
of MOS/MTJ-hybrid nonvolatile logic-in-memory architecture in dark-silicon era,” in Electron Devices Meeting (IEDM), 2014
IEEE International. IEEE, 2014, pp. 28–2.
[38] B. Lokesh and M. Malathi, “Full adder based reconfigurable spintronic ALU using STT-MTJ,” in India Conference (INDICON),
2013 Annual IEEE. IEEE, 2013, pp. 1–5.
[39] D. Kumar, M. SaW, and A. Islam, “Design of 21 multiplexer and 12 demultiplexer using magnetic tunnel junction elements,”
in Emerging Trends in VLSI, Embedded System, Nano Electronics and Telecommunication System (ICEVENT), 2013 International
Conference on. IEEE, 2013, pp. 1–5.
[40] X. Yao, J. Harms, A. Lyle, F. Ebrahimi, Y. Zhang, and J.-P. Wang, “Magnetic tunnel junction-based spintronic logic units
operated by spin transfer torque,” IEEE Transactions on Nanotechnology, vol. 11, no. 1, pp. 120–126, 2012.
[41] S. R. Patil, X. Yao, H. Meng, J.-P. Wang, and D. J. Lilja, “Design of a spintronic arithmetic and logic unit using magnetic
tunnel junctions,” in Proceedings of the 5th conference on Computing frontiers. ACM, 2008, pp. 171–178.
[42] Q. An, S. Le Beux, I. O’Connor, J. O. Klein, and W. Zhao, “Arithmetic Logic Unit based on all-spin logic devices,” in New
Circuits and Systems Conference (NEWCAS), 2017 15th IEEE International. IEEE, 2017, pp. 317–320.
[43] F. Ren and D. Markovic, “True energy-performance analysis of the MTJ-based logic-in-memory architecture (1-bit full
adder),” IEEE Transactions on Electron Devices, vol. 57, no. 5, pp. 1023–1028, 2010.
[44] E. Deng, Z. Wang, J.-O. Klein, G. Prenat, B. Dieny, and W. Zhao, “High-frequency low-power magnetic full-adder based on
magnetic tunnel junction with spin-hall assistance,” IEEE Transactions on Magnetics, vol. 51, no. 11, pp. 1–4, 2015.
[45] D. Fan, Z. He, and S. Angizi, “Leveraging Spintronic Devices for Ultra-Low Power In-Memory Computing: Logic and Neural
Network,” pp. 1109–1112, 2017.
[46] H. Zhang, W. Kang, L. Wang, K. L. Wang, and W. Zhao, “Stateful Reconfigurable Logic via a Single-Voltage-Gated Spin
Hall-Effect Driven Magnetic Tunnel Junction in a Spintronic Memory,” IEEE Transactions on Electron Devices, vol. 64, no. 10,
36
pp. 4295–4301, 2017.
[47] S. Angizi, Z. He, F. Parveen, and D. Fan, “IMCE: energy-efficient bit-wise in-memory convolution engine for deep neural
network,” in Proceedings of the 23rd Asia and South Pacific Design Automation Conference. IEEE Press, 2018, pp. 111–116.
[48] D. Fan and S. Angizi, “Energy Efficient In-Memory Binary Deep Neural Network Accelerator with Dual-Mode SOT-MRAM,”
in 2017 IEEE 35th International Conference on Computer Design (ICCD). IEEE, 2017, pp. 609–612.
[49] L. Chang, Z. Wang, Y. Zhang, and W. Zhao, “Reconfigurable processing in memory architecture based on spin orbit torque,”
in 2017 IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH). IEEE, 2017, pp. 95–96.
[50] D. Fan, S. Angizi, and Z. He, “In-Memory Computing with Spintronic Devices,” in VLSI (ISVLSI), 2017 IEEE Computer Society
Annual Symposium on. IEEE, 2017, pp. 683–688.
[51] F. Parveen, S. Angizi, Z. He, and D. Fan, “Low power in-memory computing based on dual-mode SOT-MRAM,” in Low
Power Electronics and Design (ISLPED, 2017 IEEE/ACM International Symposium on. IEEE, 2017, pp. 1–6.
[52] A. Roohi, R. Zand, D. Fan, and R. F. DeMara, “Voltage-based concatenatable full adder using spin Hall effect switching,”
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 36, no. 12, pp. 2134–2138, 2017.
[53] A. Jaiswal, A. Agrawal, and K. Roy, “In-situ, In-Memory Stateful Vector Logic Operations based on Voltage Controlled
Magnetic Anisotropy,” Scientific reports, vol. 8, no. 1, p. 5738, 2018.
[54] L. Wang, W. Kang, F. Ebrahimi, X. Li, Y. Huang, C. Zhao, K. L. Wang, and W. Zhao, “Voltage-controlled magnetic tunnel
junctions for processing-in-memory implementation,” IEEE Electron Device Letters, vol. 39, no. 3, pp. 440–443, 2018.
[55] H.-P. Trinh, W. Zhao, J.-O. Klein, Y. Zhang, D. Ravelsona, and C. Chappert, “Domain wall motion based magnetic adder,”
Electronics letters, vol. 48, no. 17, pp. 1049–1051, 2012.
[56] T. Luo, W. Zhang, B. He, and D. Maskell, “A racetrack memory based in-memory booth multiplier for cryptography
application,” in Design Automation Conference (ASP-DAC), 2016 21st Asia and South Pacific. IEEE, 2016, pp. 286–291.
[57] K. Huang and R. Zhao, “Magnetic domain-wall racetrack memory-based nonvolatile logic for low-power computing and
fast run-time-reconfiguration,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 24, no. 9, pp. 2861–2872,
2016.
[58] Y. Wang, P. Kong, and H. Yu, “Logic-in-memory based big-data computing by nonvolatile domain-wall nanowire devices,”
in Non-Volatile Memory Technology Symposium (NVMTS), 2013 13th. IEEE, 2013, pp. 1–6.
[59] H. Yu, Y. Wang, S. Chen, W. Fei, C. Weng, J. Zhao, and Z. Wei, “Energy efficient in-memory machine learning for data
intensive image-processing by non-volatile domain-wall memory,” in Design Automation Conference (ASP-DAC), 2014 19th
Asia and South Pacific. IEEE, 2014, pp. 191–196.
[60] J. Chung, J. Park, and S. Ghosh, “Domain wall memory based convolutional neural networks for bit-width extendability and
energy-efficiency,” in ISLPED. ACM, 2016, pp. 332–337.
[61] K. Roy, M. Sharad, D. Fan, and K. Yogendra, “Exploring Boolean and non-Boolean computing with spin torque devices,” in
Computer-Aided Design (ICCAD), 2013 IEEE/ACM International Conference on. IEEE, 2013, pp. 576–580.
[62] A. Sengupta, Y. Shim, and K. Roy, “Proposal for an all-spin artificial neural network: Emulating neural and synaptic
functionalities through domain wall motion in ferromagnets,” IEEE TBioCAS, vol. 10, no. 6, pp. 1152–1160, 2016.
[63] S. Deb, L. Ni, H. Yu, and A. Chattopadhyay, “Racetrack memory-based encoder/decoder for low-power interconnect
architectures,” in SAMOS. IEEE, 2016, pp. 281–287.
[64] S. Angizi, Z. He, N. Bagherzadeh, and D. Fan, “Design and Evaluation of a Spintronic In-Memory Processing Platform for
Non-Volatile Data Encryption,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2017.
[65] Y. Wang, H. Yu, D. Sylvester, and P. Kong, “Energy efficient in-memory AES encryption based on nonvolatile domainwall nanowire,” in Proceedings of the conference on Design, Automation & Test in Europe. European Design and Automation
Association, 2014, p. 183.
[66] S. Angizi, Z. He, R. F. DeMara, and D. Fan, “Composite spintronic accuracy-configurable adder for low power digital signal
processing,” in Quality Electronic Design (ISQED), 2017 18th International Symposium on. IEEE, 2017, pp. 391–396.
[67] Z. He, S. Angizi, F. Parveen, and D. Fan, “Leveraging Dual-Mode Magnetic Crossbar for Ultra-low Energy In-Memory Data
Encryption,” in Proceedings of the on Great Lakes Symposium on VLSI 2017. ACM, 2017, pp. 83–88.
[68] S. G. Ramasubramanian, R. Venkatesan, M. Sharad, K. Roy, and A. Raghunathan, “SPINDLE: SPINtronic deep learning
engine for large-scale neuromorphic computing,” in Proceedings of the 2014 international symposium on Low power electronics
and design. ACM, 2014, pp. 15–20.
[69] X. Chen, W. Kang, D. Zhu, X. Zhang, N. Lei, Y. Zhang, Y. Zhou, and W. Zhao, “A compact skyrmionic leaky–integrate–fire
spiking neuron device,” Nanoscale, vol. 10, no. 13, pp. 6139–6146, 2018.
[70] S. Li, W. Kang, Y. Huang, X. Zhang, Y. Zhou, and W. Zhao, “Magnetic skyrmion-based artificial neuron device,”
Nanotechnology, vol. 28, no. 31, p. 31LT01, 2017.
[71] Y. Huang, W. Kang, X. Zhang, Y. Zhou, and W. Zhao, “Magnetic skyrmion-based synaptic devices,” Nanotechnology, vol. 28,
no. 8, p. 08LT02, 2017.
[72] Z. Wang, L. Zhang, M. Wang, Z. Wang, D. Zhu, Y. Zhang, and W. Zhao, “High-density nand-like spin transfer torque memory
with spin orbit torque erase operation,” IEEE Electron Device Letters, vol. 39, no. 3, pp. 343–346, 2018.
[73] S. Mittal, “A Survey Of Techniques for Approximate Computing,” ACM Computing Surveys, 2016.
[74] A. D. Booth, “A signed binary multiplication technique,” The Quarterly Journal of Mechanics and Applied Mathematics, vol. 4,
no. 2, pp. 236–240, 1951.
[75] C. S. Lent, M. Liu, and Y. Lu, “Bennett clocking of quantum-dot cellular automata and the limits to binary logic scaling,”
Nanotechnology, vol. 17, no. 16, p. 4240, 2006.
[76] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification using binary convolutional neural
networks,” in European Conference on Computer Vision. Springer, 2016, pp. 525–542.
[77] S. Mittal, “A Survey of ReRAM-based Architectures for Processing-in-memory and Neural Networks,” Machine learning and
knowledge extraction, vol. 1, p. 5, 2018.
37
[78] M. Natsui, D. Suzuki, N. Sakimura, R. Nebashi, Y. Tsuji, A. Morioka, T. Sugibayashi, S. Miura, H. Honjo, K. Kinoshita et al.,
“Nonvolatile logic-in-memory lsi using cycle-based power gating and its application to motion-vector prediction,” IEEE
Journal of Solid-State Circuits, vol. 50, no. 2, pp. 476–489, 2015.
[79] S. Mittal and A. Alsalibi, “A survey of techniques for improving security of non-volatile memories,” Journal of Hardware and
Systems Security, 2018.
[80] Y. Wang, L. Ni, C.-H. Chang, and H. Yu, “DW-AES: A domain-wall nanowire-based AES for high throughput and energyefficient data encryption in non-volatile memory,” IEEE Transactions on Information Forensics and Security, vol. 11, no. 11, pp.
2426–2440, 2016.
[81] S. Mittal, “A Survey of Recent Prefetching Techniques for Processor Caches,” ACM Computing Surveys, 2016.
[82] S. Mittal and J. Vetter, “A Survey Of Architectural Approaches for Data Compression in Cache and Main Memory Systems,”
IEEE TPDS, vol. 27, no. 5, pp. 1524–1536, 2016.
[83] S. Mittal and M. S. Inukonda, “A Survey of Techniques for Improving Error-Resilience of DRAM,” Journal of Systems
Architecture, vol. 91, pp. 11–40, 2018.