diff --git a/doc/crypto.dox b/doc/crypto.dox index 1ab97c14..a42d6a9e 100644 --- a/doc/crypto.dox +++ b/doc/crypto.dox @@ -136,14 +136,21 @@ Ardunino Mega 2560 running at 16 MHz are similar: P521::sign()60514msDigital signature generation P521::verify()109078msDigital signature verification P521::derivePublicKey()46290msDerive a public key from a private key +NewHope::keygen(), Ref639msGenerate key pair for Alice, Ref version +NewHope::sharedb(), Ref1237msGenerate shared secret and public key for Bob, Ref version +NewHope::shareda(), Ref496msGenerate shared secret for Alice, Ref version +NewHope::keygen(), Torref777msGenerate key pair for Alice, Torref version +NewHope::sharedb(), Torref1376msGenerate shared secret and public key for Bob, Torref version +NewHope::shareda(), Torref496msGenerate shared secret for Alice, Torref version Where a cipher supports more than one key size (such as ChaCha), the values are typically almost identical for 128-bit and 256-bit keys so only the maximum is shown above. -Due to the memory requirements, NewHope is not yet possible on AVR-based -Arduino systems. +Due to the memory requirements, P521 and NewHope performance was measured on +an Arduino Mega 2560 running at 16 MHz. They are too big to fit in the +RAM size of the Uno. \subsection crypto_performance_arm Performance on ARM @@ -213,7 +220,7 @@ All figures are for the Arduino Due running at 84 MHz: P521::verify()3423msDigital signature verification P521::derivePublicKey()1503msDerive a public key from a private key NewHope::keygen(), Ref29msGenerate key pair for Alice, Ref version -NewHope::sharedb(), Ref40msGenerate shared secret and public key for Bob, Ref version +NewHope::sharedb(), Ref41msGenerate shared secret and public key for Bob, Ref version NewHope::shareda(), Ref9msGenerate shared secret for Alice, Ref version NewHope::keygen(), Torref42msGenerate key pair for Alice, Torref version NewHope::sharedb(), Torref53msGenerate shared secret and public key for Bob, Torref version diff --git a/doc/newhope-small.dox b/doc/newhope-small.dox new file mode 100644 index 00000000..dae1a547 --- /dev/null +++ b/doc/newhope-small.dox @@ -0,0 +1,321 @@ +/* + * Copyright (C) 2016 Southern Storm Software, Pty Ltd. + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included + * in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS + * OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER + * DEALINGS IN THE SOFTWARE. + */ + +/** +\file newhope-small.dox +\page newhope_small Small Memory Footprint New Hope + +This page describes the techniques that were used to reduce the +post-quantum New Hope +key exchange algorithm in size for running on Arduino systems with limited +amounts of RAM. It is intended to help other implementors of New Hope +save time in figuring out how to reduce the memory size of the algorithm. + +On systems like AVR and x86 that allow byte-aligned access to 16-bit values, +this implementation requires around 2K of memory for the function parameters +and up to 4.5K of temporary stack space for intermediate values. On systems +like ARM, the sizes are similar but the sharedb() function requires another +2K of temporary stack space if the input parameters are not aligned on a +16-bit boundary. + +\section newhope_small_keygen keygen() + +In pseudo-code, the keygen() function from the reference C implementation of +New Hope from the algorithm authors performs the following operations +(the size in bytes of all parameters and local variables are indicated): + +\code +keygen(send[1824], sk[2048]): + locals: seed[32], noiseseed[32], a[2048], e[2048], r[2048], pk[2048] + seed = sha3(randombytes(32)) + noiseseed = randombytes(32) + a = uniform(seed) + sk = ntt(getnoise(noiseseed, 0)) + e = ntt(getnoise(noiseseed, 1)) + r = pointwise(sk, a) + pk = e + r + send = encode_a(pk, seed) +\endcode + +This requires a total of 3872 bytes of parameter space and 8256 bytes of +stack space. There is also additional stack space for temporary SHA3, +SHAKE128, and ChaCha20 objects and output buffers. Those objects can +easily account for another 400 to 500 bytes of stack space. + +We note that some of the local variables in the pseudo-code above are only +live in some parts of function. For example, pk is not touched until +the second-last statement and by that time sk and a are no +longer required. We can rearrange the function to reuse local variables +that are no longer live as follows: + +\code +keygen(send[1824], sk[2048]): + locals: seed[32], noiseseed[32], a[2048], pk[2048] + seed = sha3(randombytes(32)) + noiseseed = randombytes(32) + a = uniform(seed) + sk = ntt(getnoise(noiseseed, 0)) + pk = pointwise(sk, a) + a = ntt(getnoise(noiseseed, 1)) + pk = a + pk + send = encode_a(pk, seed) +\endcode + +This saves 4096 bytes of stack space. It is possible to save the 64 bytes +for seed and noiseseed by directly writing them to the +send buffer: + +\code +keygen(send[1824], sk[2048]): + locals: a[2048], pk[2048] + send(1792:1823) = sha3(randombytes(32)) + send(0:31) = randombytes(32) + a = uniform(send(1792:1823)) + sk = ntt(getnoise(send(0:31), 0)) + pk = pointwise(sk, a) + a = ntt(getnoise(send(0:31), 1)) + pk = a + pk + send(0:1791) = tobytes(pk) +\endcode + +Packing temporary values into the caller-supplied parameters is a common +feature of the optimizations described on this page. Since the caller +has already supplied a big chunk of free memory to the function, it would +be a shame not to make use of it. + +The Arduino implementation also packs the temporary SHA3, SHAKE128, and +ChaCha20 objects into the send buffer and unused local variables at +different points in the function. This considerably reduces the stack +footprint of sub-functions like uniform(), getnoise(), and helprec(). + +At this point we are using 3872 of parameter space and 4096 bytes of +stack space. We can reduce the parameter space even further by noticing +that the sk value is wholely determined by the 32-byte +noiseseed value. The shareda() function could regenerate +sk itself from the 32-byte noiseseed, trading off time +for memory: + +\code +keygen(send[1824], noiseseed[32]): + locals: a[2048], pk[2048] + send(1792:1823) = sha3(randombytes(32)) + noiseseed = randombytes(32) + a = uniform(send(1792:1823)) + pk = ntt(getnoise(noiseseed, 0)) + pk = pointwise(pk, a) + a = ntt(getnoise(noiseseed, 1)) + pk = a + pk + send(0:1791) = tobytes(pk) +\endcode + +Now we have 1856 bytes of parameter space and 4096 bytes of stack space. +Plus a few hundred bytes of stack frame overhead for sub-functions +(the Arduino version of SHA3/SHAKE128 requires 200 bytes of stack space +for temporary values - other sub-functions are similar). The Arduino +version of New Hope uses up to 400 bytes of stack space overhead in +the worst case. + +The uniform() function has two variants for the "ref" and "torref" versions +of the New Hope algorithm. The "torref" variant requires 2688 bytes to +represent the a value before sorting reduces it to 2048 bytes. This +isn't actually a problem because we can lay out the stack space with a union: + +\code +struct { + union { + uint16_t a[PARAM_N]; + uint16_t pk[PARAM_N]; + }; + uint16_t a_ext[84 * 16]; +} state; +\endcode + +The uniform data derived from the seed is generated into a_ext, +sorted, and then the trailing 640 bytes of a_ext are discarded. +The trailing space is then used to store pk later in the function. + +\section newhope_small_shareda shareda() + +Before tackling the more difficult sharedb(), we will move onto the final +New Hope step for generating the shared secret for Alice. In pseudo-code, +the original reference C implementation is as follows: + +\code +shareda(shared[32], sk[2048], received[2048]): + locals: v[2048], bp[2048], c[2048] + (bp, c) = decode_b(received) + v = invntt(pointwise(sk, bp)) + shared = sha3(rec(v, c)) +\endcode + +We can eliminate c by splitting the decode_b() step: + +\code +shareda(shared[32], sk[2048], received[2048]): + locals: v[2048], bp[2048] + bp = decode_b_1st_half(received(0:1791)) + v = invntt(pointwise(sk, bp)) + bp = decode_b_2nd_half(received(1792:2047)) + shared = sha3(rec(v, bp)) +\endcode + +We now have 4128 bytes of parameter space and 4096 bytes of stack space. +The shared buffer can overlap with either sk or received +in the caller to save another 32 bytes of parameter space. + +Earlier we replaced sk with the 32-byte noiseseed. We can +regenerate sk within shareda() as follows: + +\code +shareda(shared[32], noiseseed[32], received[2048]): + locals: v[2048], bp[2048] + v = ntt(getnoise(noiseseed, 0)) + bp = decode_b_1st_half(received(0:1791)) + v = invntt(pointwise(v, bp)) + bp = decode_b_2nd_half(received(1792:2047)) + shared = sha3(rec(v, bp)) +\endcode + +This results in 2112 bytes of parameter space (2080 if shared +overlaps with noiseseed or received) and 4096 bytes +of direct stack space. Plus up to 400 bytes of stack overhead for +sub-functions as before. + +\section newhope_small_sharedb sharedb() + +As before we start with the pseudo-code for the reference C implementation +of sharedb(): + +\code +sharedb(shared[32], send[2048], received[1824]): + locals: sp[2048], ep[2048], v[2048], a[2048], pka[2048], + c[2048], epp[2048], bp[2048], seed[32], noiseseed[32] + noiseseed = randombytes(32) + (pka, seed) = decode_a(received) + a = uniform(seed) + sp = ntt(getnoise(noiseseed, 0)) + ep = ntt(getnoise(noiseseed, 1)) + bp = pointwise(a, sp) + bp = bp + ep + v = invntt(pointwise(pka, sp)) + epp = getnoise(noiseseed, 2)) + v = v + epp + c = helprec(v, noiseseed, 3) + send = encode_b(bp, c) + shared = sha3(rec(v, c)) +\endcode + +This requires a massive 3904 bytes of parameter space and 16448 bytes +of stack space! We start by doing liveness analysis on the local +variables and hiding seed and noiseseed inside parameters: + +\code +sharedb(shared[32], send[2048], received[1824]): + locals: a[2048], v[2048], bp[2048] + send(1824:1855) = randombytes(32) + a = uniform(received(1792:1823)) + v = ntt(getnoise(send(1824:1855), 0)) + bp = pointwise(a, v) + a = ntt(getnoise(send(1824:1855), 1)) + bp = bp + a + a = frombytes(received(0:1791)) + v = invntt(pointwise(a, v)) + a = getnoise(send(1824:1855), 2) + v = v + a + a = helprec(v, send(1824:1855), 3) + send = encode_b(bp, a) + shared = sha3(rec(v, a)) +\endcode + +Now we are down to 3904 bytes of parameter space and 6144 bytes of +stack space. We can save 1824 bytes of parameter space by combining +the send and received buffers into one 2048 buffer. +On entry, this combined buffer contains Alice's public key and on exit +it contains Bob's public key. Now it is 2080 bytes of parameter space. + +Note above that noiseseed was placed into bytes 1824-1855 of +send. This was to ensure that it did not overwrite the +received value if the buffers were shared. + +This is the best we can do on systems that require that 16-bit values +are aligned on 16-bit address boundaries. If however we are operating on +an 8-bit system like the AVR, we can do even better. The send +buffer is the same size as bp: 2048 bytes. As long as we are +careful to move the incoming values in received out of the way +before-hand, we can use the send buffer as a temporary poly object: + +\code +sharedb(shared[32], send[2048], received[1824]): + locals: a[2048], v[2048], seed[32], noiseseed[32] + noiseseed = randombytes(32) + (a, seed) = decode_a(received) + send = ntt(getnoise(noiseseed, 0)) + v = invntt(pointwise(a, send)) + send = getnoise(noiseseed, 2) + v = v + send + a = helprec(v, noiseseed, 3) + send(1792:2047) = encode_b_2nd_half(a) + shared = sha3(rec(v, a)) + a = uniform(seed) + v = ntt(getnoise(noiseseed, 0)) + a = pointwise(a, v) + v = ntt(getnoise(noiseseed, 1)) + a = a + v + send(0:1791) = encode_b_1st_half(a) +\endcode + +This requires 3904 bytes of parameter space and 4160 bytes of stack space. +The parameter space can be further reduced to 2080 bytes if send +and received occupy the same buffer. Plus up to 400 bytes of +stack overhead for sub-functions as before. + +Note that "ntt(getnoise(noiseseed, 0))" is evaluated twice. This frees up +a local variable earlier in the function, at the cost of some speed. + +\section newhope_small_summary Summary + +In summary, the three primitives of New Hope require the following amounts +of memory on systems with byte alignment and buffer sharing: + + + + + + +
PrimitiveParameter SpaceDirect Stack SpaceStack with Overhead (400 bytes)Parameters + Stack + Overhead
keygen()1856409644966352
sharedb()2080416045606640
shareda()2080409644966576
+ +On 16-bit, 32-bit, or 64-bit systems that lack byte alignment, +with a full 2048-byte public key for Alice, and no buffer sharing, +the maximum memory requirements are: + + + + + + +
PrimitiveParameter SpaceDirect Stack SpaceStack with Overhead (400 bytes)Parameters + Stack + Overhead
keygen()3872409644968368
sharedb()39046144654410448
shareda()4128409644968624
+ +All operations can be performed in around 6.5K of memory on an 8-bit +AVR Arduino system, and with at most 10.2K of memory on a 32-bit ARM +Arduino system. + +*/ diff --git a/libraries/NewHope/NewHope.cpp b/libraries/NewHope/NewHope.cpp index 822f64ca..5e89b812 100644 --- a/libraries/NewHope/NewHope.cpp +++ b/libraries/NewHope/NewHope.cpp @@ -53,12 +53,12 @@ void *operator new(size_t size, void *ptr) * New Hope is an ephemeral key exchange algorithm, similar to Diffie-Hellman, * which is believed to be resistant to quantum computers. * - * \note The functions in this class need up to 7k of stack space to - * store temporary intermediate values in addition to up to 4k of - * memory in the application to store public and private key parameters. - * Due to these memory requirements, this class is only suitable for - * use on high-end ARM-based Arduino variants like the Arduino Due. - * It won't fit in the available memory on AVR-based Arduino variants. + * \note The functions in this class need a substantial amount of memory + * for function parameters and stack space. On an 8-bit AVR system + * it is possible to operate with around 2K of parameter space and 4.5K of + * stack space if the parameters are in shared buffers. More information + * on the memory requirements and how they were reduced are on + * \ref newhope_small "this page". * * Key exchange occurs between two parties, Alice and Bob, and results * in a 32-byte (256-bit) shared secret. Alice's public key is 1824 @@ -86,6 +86,16 @@ void *operator new(size_t size, void *ptr) * and can then begin encrypting session traffic with shared_secret * or some transformed version of it. * + * To reduce the memory requirements, the second and third parameters to + * sharedb() can point to the same 2048-byte buffer. On entry, the first + * 1824 bytes of the buffer are filled with Alice's public key. On exit, + * the buffer is filled with the 2048 bytes of Bob's public key: + * + * \code + * uint8_t shared_secret[NEWHOPE_SHAREDBYTES]; + * NewHope::sharedb(shared_secret, public_key, public_key); + * \endcode + * * When Alice's application receives bob_public, the application * performs the folllowing final steps to generate her version of the * shared secret: