diff --git a/doc/crypto.dox b/doc/crypto.dox
index 1ab97c14..a42d6a9e 100644
--- a/doc/crypto.dox
+++ b/doc/crypto.dox
@@ -136,14 +136,21 @@ Ardunino Mega 2560 running at 16 MHz are similar:
P521::sign() | 60514ms | Digital signature generation |
P521::verify() | 109078ms | Digital signature verification |
P521::derivePublicKey() | 46290ms | Derive a public key from a private key |
+NewHope::keygen(), Ref | 639ms | Generate key pair for Alice, Ref version |
+NewHope::sharedb(), Ref | 1237ms | Generate shared secret and public key for Bob, Ref version |
+NewHope::shareda(), Ref | 496ms | Generate shared secret for Alice, Ref version |
+NewHope::keygen(), Torref | 777ms | Generate key pair for Alice, Torref version |
+NewHope::sharedb(), Torref | 1376ms | Generate shared secret and public key for Bob, Torref version |
+NewHope::shareda(), Torref | 496ms | Generate shared secret for Alice, Torref version |
Where a cipher supports more than one key size (such as ChaCha), the values
are typically almost identical for 128-bit and 256-bit keys so only the
maximum is shown above.
-Due to the memory requirements, NewHope is not yet possible on AVR-based
-Arduino systems.
+Due to the memory requirements, P521 and NewHope performance was measured on
+an Arduino Mega 2560 running at 16 MHz. They are too big to fit in the
+RAM size of the Uno.
\subsection crypto_performance_arm Performance on ARM
@@ -213,7 +220,7 @@ All figures are for the Arduino Due running at 84 MHz:
P521::verify() | 3423ms | Digital signature verification |
P521::derivePublicKey() | 1503ms | Derive a public key from a private key |
NewHope::keygen(), Ref | 29ms | Generate key pair for Alice, Ref version |
-NewHope::sharedb(), Ref | 40ms | Generate shared secret and public key for Bob, Ref version |
+NewHope::sharedb(), Ref | 41ms | Generate shared secret and public key for Bob, Ref version |
NewHope::shareda(), Ref | 9ms | Generate shared secret for Alice, Ref version |
NewHope::keygen(), Torref | 42ms | Generate key pair for Alice, Torref version |
NewHope::sharedb(), Torref | 53ms | Generate shared secret and public key for Bob, Torref version |
diff --git a/doc/newhope-small.dox b/doc/newhope-small.dox
new file mode 100644
index 00000000..dae1a547
--- /dev/null
+++ b/doc/newhope-small.dox
@@ -0,0 +1,321 @@
+/*
+ * Copyright (C) 2016 Southern Storm Software, Pty Ltd.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included
+ * in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
+ * OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ */
+
+/**
+\file newhope-small.dox
+\page newhope_small Small Memory Footprint New Hope
+
+This page describes the techniques that were used to reduce the
+post-quantum New Hope
+key exchange algorithm in size for running on Arduino systems with limited
+amounts of RAM. It is intended to help other implementors of New Hope
+save time in figuring out how to reduce the memory size of the algorithm.
+
+On systems like AVR and x86 that allow byte-aligned access to 16-bit values,
+this implementation requires around 2K of memory for the function parameters
+and up to 4.5K of temporary stack space for intermediate values. On systems
+like ARM, the sizes are similar but the sharedb() function requires another
+2K of temporary stack space if the input parameters are not aligned on a
+16-bit boundary.
+
+\section newhope_small_keygen keygen()
+
+In pseudo-code, the keygen() function from the reference C implementation of
+New Hope from the algorithm authors performs the following operations
+(the size in bytes of all parameters and local variables are indicated):
+
+\code
+keygen(send[1824], sk[2048]):
+ locals: seed[32], noiseseed[32], a[2048], e[2048], r[2048], pk[2048]
+ seed = sha3(randombytes(32))
+ noiseseed = randombytes(32)
+ a = uniform(seed)
+ sk = ntt(getnoise(noiseseed, 0))
+ e = ntt(getnoise(noiseseed, 1))
+ r = pointwise(sk, a)
+ pk = e + r
+ send = encode_a(pk, seed)
+\endcode
+
+This requires a total of 3872 bytes of parameter space and 8256 bytes of
+stack space. There is also additional stack space for temporary SHA3,
+SHAKE128, and ChaCha20 objects and output buffers. Those objects can
+easily account for another 400 to 500 bytes of stack space.
+
+We note that some of the local variables in the pseudo-code above are only
+live in some parts of function. For example, pk is not touched until
+the second-last statement and by that time sk and a are no
+longer required. We can rearrange the function to reuse local variables
+that are no longer live as follows:
+
+\code
+keygen(send[1824], sk[2048]):
+ locals: seed[32], noiseseed[32], a[2048], pk[2048]
+ seed = sha3(randombytes(32))
+ noiseseed = randombytes(32)
+ a = uniform(seed)
+ sk = ntt(getnoise(noiseseed, 0))
+ pk = pointwise(sk, a)
+ a = ntt(getnoise(noiseseed, 1))
+ pk = a + pk
+ send = encode_a(pk, seed)
+\endcode
+
+This saves 4096 bytes of stack space. It is possible to save the 64 bytes
+for seed and noiseseed by directly writing them to the
+send buffer:
+
+\code
+keygen(send[1824], sk[2048]):
+ locals: a[2048], pk[2048]
+ send(1792:1823) = sha3(randombytes(32))
+ send(0:31) = randombytes(32)
+ a = uniform(send(1792:1823))
+ sk = ntt(getnoise(send(0:31), 0))
+ pk = pointwise(sk, a)
+ a = ntt(getnoise(send(0:31), 1))
+ pk = a + pk
+ send(0:1791) = tobytes(pk)
+\endcode
+
+Packing temporary values into the caller-supplied parameters is a common
+feature of the optimizations described on this page. Since the caller
+has already supplied a big chunk of free memory to the function, it would
+be a shame not to make use of it.
+
+The Arduino implementation also packs the temporary SHA3, SHAKE128, and
+ChaCha20 objects into the send buffer and unused local variables at
+different points in the function. This considerably reduces the stack
+footprint of sub-functions like uniform(), getnoise(), and helprec().
+
+At this point we are using 3872 of parameter space and 4096 bytes of
+stack space. We can reduce the parameter space even further by noticing
+that the sk value is wholely determined by the 32-byte
+noiseseed value. The shareda() function could regenerate
+sk itself from the 32-byte noiseseed, trading off time
+for memory:
+
+\code
+keygen(send[1824], noiseseed[32]):
+ locals: a[2048], pk[2048]
+ send(1792:1823) = sha3(randombytes(32))
+ noiseseed = randombytes(32)
+ a = uniform(send(1792:1823))
+ pk = ntt(getnoise(noiseseed, 0))
+ pk = pointwise(pk, a)
+ a = ntt(getnoise(noiseseed, 1))
+ pk = a + pk
+ send(0:1791) = tobytes(pk)
+\endcode
+
+Now we have 1856 bytes of parameter space and 4096 bytes of stack space.
+Plus a few hundred bytes of stack frame overhead for sub-functions
+(the Arduino version of SHA3/SHAKE128 requires 200 bytes of stack space
+for temporary values - other sub-functions are similar). The Arduino
+version of New Hope uses up to 400 bytes of stack space overhead in
+the worst case.
+
+The uniform() function has two variants for the "ref" and "torref" versions
+of the New Hope algorithm. The "torref" variant requires 2688 bytes to
+represent the a value before sorting reduces it to 2048 bytes. This
+isn't actually a problem because we can lay out the stack space with a union:
+
+\code
+struct {
+ union {
+ uint16_t a[PARAM_N];
+ uint16_t pk[PARAM_N];
+ };
+ uint16_t a_ext[84 * 16];
+} state;
+\endcode
+
+The uniform data derived from the seed is generated into a_ext,
+sorted, and then the trailing 640 bytes of a_ext are discarded.
+The trailing space is then used to store pk later in the function.
+
+\section newhope_small_shareda shareda()
+
+Before tackling the more difficult sharedb(), we will move onto the final
+New Hope step for generating the shared secret for Alice. In pseudo-code,
+the original reference C implementation is as follows:
+
+\code
+shareda(shared[32], sk[2048], received[2048]):
+ locals: v[2048], bp[2048], c[2048]
+ (bp, c) = decode_b(received)
+ v = invntt(pointwise(sk, bp))
+ shared = sha3(rec(v, c))
+\endcode
+
+We can eliminate c by splitting the decode_b() step:
+
+\code
+shareda(shared[32], sk[2048], received[2048]):
+ locals: v[2048], bp[2048]
+ bp = decode_b_1st_half(received(0:1791))
+ v = invntt(pointwise(sk, bp))
+ bp = decode_b_2nd_half(received(1792:2047))
+ shared = sha3(rec(v, bp))
+\endcode
+
+We now have 4128 bytes of parameter space and 4096 bytes of stack space.
+The shared buffer can overlap with either sk or received
+in the caller to save another 32 bytes of parameter space.
+
+Earlier we replaced sk with the 32-byte noiseseed. We can
+regenerate sk within shareda() as follows:
+
+\code
+shareda(shared[32], noiseseed[32], received[2048]):
+ locals: v[2048], bp[2048]
+ v = ntt(getnoise(noiseseed, 0))
+ bp = decode_b_1st_half(received(0:1791))
+ v = invntt(pointwise(v, bp))
+ bp = decode_b_2nd_half(received(1792:2047))
+ shared = sha3(rec(v, bp))
+\endcode
+
+This results in 2112 bytes of parameter space (2080 if shared
+overlaps with noiseseed or received) and 4096 bytes
+of direct stack space. Plus up to 400 bytes of stack overhead for
+sub-functions as before.
+
+\section newhope_small_sharedb sharedb()
+
+As before we start with the pseudo-code for the reference C implementation
+of sharedb():
+
+\code
+sharedb(shared[32], send[2048], received[1824]):
+ locals: sp[2048], ep[2048], v[2048], a[2048], pka[2048],
+ c[2048], epp[2048], bp[2048], seed[32], noiseseed[32]
+ noiseseed = randombytes(32)
+ (pka, seed) = decode_a(received)
+ a = uniform(seed)
+ sp = ntt(getnoise(noiseseed, 0))
+ ep = ntt(getnoise(noiseseed, 1))
+ bp = pointwise(a, sp)
+ bp = bp + ep
+ v = invntt(pointwise(pka, sp))
+ epp = getnoise(noiseseed, 2))
+ v = v + epp
+ c = helprec(v, noiseseed, 3)
+ send = encode_b(bp, c)
+ shared = sha3(rec(v, c))
+\endcode
+
+This requires a massive 3904 bytes of parameter space and 16448 bytes
+of stack space! We start by doing liveness analysis on the local
+variables and hiding seed and noiseseed inside parameters:
+
+\code
+sharedb(shared[32], send[2048], received[1824]):
+ locals: a[2048], v[2048], bp[2048]
+ send(1824:1855) = randombytes(32)
+ a = uniform(received(1792:1823))
+ v = ntt(getnoise(send(1824:1855), 0))
+ bp = pointwise(a, v)
+ a = ntt(getnoise(send(1824:1855), 1))
+ bp = bp + a
+ a = frombytes(received(0:1791))
+ v = invntt(pointwise(a, v))
+ a = getnoise(send(1824:1855), 2)
+ v = v + a
+ a = helprec(v, send(1824:1855), 3)
+ send = encode_b(bp, a)
+ shared = sha3(rec(v, a))
+\endcode
+
+Now we are down to 3904 bytes of parameter space and 6144 bytes of
+stack space. We can save 1824 bytes of parameter space by combining
+the send and received buffers into one 2048 buffer.
+On entry, this combined buffer contains Alice's public key and on exit
+it contains Bob's public key. Now it is 2080 bytes of parameter space.
+
+Note above that noiseseed was placed into bytes 1824-1855 of
+send. This was to ensure that it did not overwrite the
+received value if the buffers were shared.
+
+This is the best we can do on systems that require that 16-bit values
+are aligned on 16-bit address boundaries. If however we are operating on
+an 8-bit system like the AVR, we can do even better. The send
+buffer is the same size as bp: 2048 bytes. As long as we are
+careful to move the incoming values in received out of the way
+before-hand, we can use the send buffer as a temporary poly object:
+
+\code
+sharedb(shared[32], send[2048], received[1824]):
+ locals: a[2048], v[2048], seed[32], noiseseed[32]
+ noiseseed = randombytes(32)
+ (a, seed) = decode_a(received)
+ send = ntt(getnoise(noiseseed, 0))
+ v = invntt(pointwise(a, send))
+ send = getnoise(noiseseed, 2)
+ v = v + send
+ a = helprec(v, noiseseed, 3)
+ send(1792:2047) = encode_b_2nd_half(a)
+ shared = sha3(rec(v, a))
+ a = uniform(seed)
+ v = ntt(getnoise(noiseseed, 0))
+ a = pointwise(a, v)
+ v = ntt(getnoise(noiseseed, 1))
+ a = a + v
+ send(0:1791) = encode_b_1st_half(a)
+\endcode
+
+This requires 3904 bytes of parameter space and 4160 bytes of stack space.
+The parameter space can be further reduced to 2080 bytes if send
+and received occupy the same buffer. Plus up to 400 bytes of
+stack overhead for sub-functions as before.
+
+Note that "ntt(getnoise(noiseseed, 0))" is evaluated twice. This frees up
+a local variable earlier in the function, at the cost of some speed.
+
+\section newhope_small_summary Summary
+
+In summary, the three primitives of New Hope require the following amounts
+of memory on systems with byte alignment and buffer sharing:
+
+
+Primitive | Parameter Space | Direct Stack Space | Stack with Overhead (400 bytes) | Parameters + Stack + Overhead |
+keygen() | 1856 | 4096 | 4496 | 6352 |
+sharedb() | 2080 | 4160 | 4560 | 6640 |
+shareda() | 2080 | 4096 | 4496 | 6576 |
+
+
+On 16-bit, 32-bit, or 64-bit systems that lack byte alignment,
+with a full 2048-byte public key for Alice, and no buffer sharing,
+the maximum memory requirements are:
+
+
+Primitive | Parameter Space | Direct Stack Space | Stack with Overhead (400 bytes) | Parameters + Stack + Overhead |
+keygen() | 3872 | 4096 | 4496 | 8368 |
+sharedb() | 3904 | 6144 | 6544 | 10448 |
+shareda() | 4128 | 4096 | 4496 | 8624 |
+
+
+All operations can be performed in around 6.5K of memory on an 8-bit
+AVR Arduino system, and with at most 10.2K of memory on a 32-bit ARM
+Arduino system.
+
+*/
diff --git a/libraries/NewHope/NewHope.cpp b/libraries/NewHope/NewHope.cpp
index 822f64ca..5e89b812 100644
--- a/libraries/NewHope/NewHope.cpp
+++ b/libraries/NewHope/NewHope.cpp
@@ -53,12 +53,12 @@ void *operator new(size_t size, void *ptr)
* New Hope is an ephemeral key exchange algorithm, similar to Diffie-Hellman,
* which is believed to be resistant to quantum computers.
*
- * \note The functions in this class need up to 7k of stack space to
- * store temporary intermediate values in addition to up to 4k of
- * memory in the application to store public and private key parameters.
- * Due to these memory requirements, this class is only suitable for
- * use on high-end ARM-based Arduino variants like the Arduino Due.
- * It won't fit in the available memory on AVR-based Arduino variants.
+ * \note The functions in this class need a substantial amount of memory
+ * for function parameters and stack space. On an 8-bit AVR system
+ * it is possible to operate with around 2K of parameter space and 4.5K of
+ * stack space if the parameters are in shared buffers. More information
+ * on the memory requirements and how they were reduced are on
+ * \ref newhope_small "this page".
*
* Key exchange occurs between two parties, Alice and Bob, and results
* in a 32-byte (256-bit) shared secret. Alice's public key is 1824
@@ -86,6 +86,16 @@ void *operator new(size_t size, void *ptr)
* and can then begin encrypting session traffic with shared_secret
* or some transformed version of it.
*
+ * To reduce the memory requirements, the second and third parameters to
+ * sharedb() can point to the same 2048-byte buffer. On entry, the first
+ * 1824 bytes of the buffer are filled with Alice's public key. On exit,
+ * the buffer is filled with the 2048 bytes of Bob's public key:
+ *
+ * \code
+ * uint8_t shared_secret[NEWHOPE_SHAREDBYTES];
+ * NewHope::sharedb(shared_secret, public_key, public_key);
+ * \endcode
+ *
* When Alice's application receives bob_public, the application
* performs the folllowing final steps to generate her version of the
* shared secret: