tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Patch: cprng_fast performance - please review.



Repeating what I said privately for the public record:

The changes to cprng.h expose a lot of guts of the new cprng_fast
implementation.  I don't see justification for this.  As I understand
it, the goals of rewriting cprng_fast are:

1. Replace the unsafe arc4random by a safer algorithm.
2. Improve scalability of cprng_fast.

And the only performance constraint is that its single-threaded
performance should not be worse than the existing arc4random-based
cprng_fast.

Since the guts of the arc4random-based cprng_fast weren't exposed in
cprng.h, it is not clear to me that, in order to satisfy those goals
subject to the performance constraints, the hc128-based cprng_fast
needs to be exposed there either.

If replacing arc4random by hc128 hurts single-threaded performance,
then that suggests revisiting the algorithm.  If replacing the global
serialized state by per-CPU state hurts single-threaded performance,
then it may be worthwhile to consider the micro-optimizations.  In
both cases, the performance hurt should be demonstrated by measurement
first, however.

I'm inclined to say that replacing the algorithm should be done
separately from replacing the bookkeeping, too, so that it is easier
to review the changes and to isolate problems should they arise.

As I understand it, you chose hc128 because its security exceeds
arc4random's and its performance in naive C code exceeds other modern
stream ciphers.  I'd never heard of hc128, so I'm nervous about it.
Have you consulted any cryptographers about confidence in its
security?

Have you done any side channel analysis of hc128?  A quick glance at
it suggests it uses only addition, bitwise operations, constant
rotation, and data-independent memory references, which is promising,
but I haven't looked closely.  We ought to avoid adopting any new
crypto that has known or predictable side channel attacks.

Some other little comments:

> +void
> +hc128_init(hc128_state_t *state, const uint8_t *key, const uint8_t *iv)
> +{
> +     unsigned int i;
> +     uint32_t w[1280], *p = state->p, *q = state->q;

5 KB on the stack is a lot!  Granted, this is a leaf routine which in
our case will be called only in a softint handler, but still.

> +void
> +hc128_extract(hc128_state_t *state, uint8_t *stream)
> +{
> +     register uint32_t ret;
> +
> +     uint16_t i = state->i;
> +     state->i = (i + 1u) & 1023u;
> +
> +     ret = (i < 512) ? round_expression(state->p, state->q, g1, i) :
> +                       round_expression(state->q, state->p, g2, m512(i));
> +
> +     unpack_littleendian(ret, stream);
> +}

For the purposes of a PRNG, conversion to little-endian is not
necessary.  Would be nice for hc128_extract to just return uint32_t,
too.

> @@ -72,6 +74,13 @@ static void        cprng_strong_rngtest(struct 
>  
>  static rndsink_callback_t    cprng_strong_rndsink_callback;
>  
> +percpu_t *percpu_cprng_fast_ctx;
> +static int cprng_fast_initialized;
> +
> +static void cprng_fast_randrekey(cprng_fast_ctx_t *);
> +
> +void *cprng_fast_rekey_softintr = NULL;
> +

Forward declaration of cprng_fast_randrekey should go among the other
forward declarations, not mixed up among the global state.

>  void
>  cprng_init(void)
>  {

Can we put the creation of kern_cprng and the initialization of
cprng_fast into cprng_init?  If not, there should be a comment
explaining why not.

> @@ -103,10 +112,11 @@ cprng_counter(void)
>               return cpu_counter32();
>  #endif
>       if (__predict_false(cold)) {
> +             static int ctr;
>               /* microtime unsafe if clock not running yet */
> -             return 0;
> +             return ctr++;
>       }
> -     microtime(&tv);
> +     getmicrotime(&tv);
>       return (tv.tv_sec * 1000000 + tv.tv_usec);
>  }

This is an unrelated change, so it should be a separate commit.

> @@ -542,31 +560,145 @@ sysctl_kern_arnd(SYSCTLFN_ARGS)
>       void *v;
>       struct sysctlnode node = *rnode;
>  
> -     if (*oldlenp == 0)
> +     switch (*oldlenp) {
> +         case 0:
>               return 0;
> +         default:
> +             if (*oldlenp > 256) {
> +                     return E2BIG;
> +             }
> +             v = kmem_alloc(*oldlenp, KM_SLEEP);
> +             cprng_fast(v, *oldlenp);
> +             node.sysctl_data = v;
> +             node.sysctl_size = *oldlenp;
> +             error = sysctl_lookup(SYSCTLFN_CALL(&node));
> +             kmem_free(v, *oldlenp);
> +             return error;
> +     }
> +}

I don't see an obvious difference in logic here.  Was there any
substantive change to sysctl_kern_arnd buried in the reformatting?
Either way, reformatting should be a separate commit.

> +static void
> +cprng_fast_randrekey(cprng_fast_ctx_t *ctx)
> +{
> +     uint8_t key[16], iv[16];
> +     hc128_state_t tempstate;
> +     int s;
> +
> +     int have_initial = rnd_initial_entropy;
> +
> +     cprng_strong(kern_cprng, key, sizeof(key), FASYNC);
> +     cprng_strong(kern_cprng, iv, sizeof(iv), FASYNC);

Why generate the IV randomly?  Why not a counter?  Does the IV have
any requirements other than that it never be reused with the same key,
i.e. is it anything other than a nonce?

> +     /* Rekey the hc128 state - expensive, don't do this at splhigh.  */
> +     hc128_init(&ctx->hc128, key, iv);
> +     explicit_memset(key, 0, sizeof(key));
> +     explicit_memset(iv, 0, sizeof(iv));
> +
> +     s = splhigh();
> +     memcpy(&ctx->hc128, &tempstate, sizeof(tempstate));
> +     splx(s);

Are there actually any callers of cprng_fast at IPL_HIGH?  Are there
actually any legitimate random decisions to be made at IPL_HIGH?  I'm
sceptical.

Our libkern arc4random works only at IPL_VM, and our random(3) looks
like it is not actually MP-safe in the kernel, so if we do have any
cases of that then they're broken to begin with.

> +void
> +cprng_fast_init(void)
> +{
> +        percpu_cprng_fast_ctx = percpu_alloc(sizeof(cprng_fast_ctx_t));
> +        percpu_foreach(percpu_cprng_fast_ctx, cprng_fast_init_ctx, NULL);
> +     cprng_fast_initialized++;

Blank line after opening brace.  Tabs/spaces.  No need for
cprng_fast_initialized to be an int; a bool suffices and is clearer.

> +size_t
> +_cprng_fast_exact(void *p, size_t len)
> +{

This should KASSERT(len <= CPRNG_MAX_LEN) or something so nobody is
tempted to generate obscene amounts of data at once (at IPL_HIGH or
IPL_VM or whatever it turns out to be).

> +     ctx->numbytes += len;

Check for arithmetic overflow and saturate.

> +size_t
> +_cprng_fast_inexact(void *p, size_t len)
> +{
> +     uint8_t *pc = p;
> +     uint32_t *pi = p, tmp, *iter;
> +     int s;
> +     size_t initial_len, aligned_len, final_len, main_len;
> +     cprng_fast_ctx_t *ctx = percpu_getref(percpu_cprng_fast_ctx);
> +
> +     KASSERT(cprng_fast_initialized);
> +
> +     initial_len = sizeof(uint32_t) - ((uintptr_t)pc % sizeof(uint32_t));
> +     aligned_len = len - initial_len;
> +     final_len = aligned_len % sizeof(uint32_t);
> +     main_len = aligned_len - final_len;
> +
> +     main_len /= sizeof(uint32_t);

You don't need to subtract final_len from main_len before dividing,
and I'm having a little trouble following the arithmetic.  This should
work:

uint8_t *p8 = p;
uint32_t *p32 = (uint32_t *)roundup2((uintptr_t)p, sizeof(uint32_t));
size_t initial_len = (uint8_t *)p32 - p8;
size_t main_len = (len - initial_len) / sizeof(uint32_t);
size_t final_len = (len - initial_len) % sizeof(uint32_t);

KASSERT(len == (initial_len + (main_len * sizeof(uint32_t)) + final_len));
KASSERT(initial_len < sizeof(uint32_t));
KASSERT(final_len < sizeof(uint32_t));

> +     if (initial_len) {
> +             hc128_extract(&ctx->hc128, (uint8_t *)&tmp);
> +             memcpy(pc, &tmp, initial_len);
> +             pi = (uint32_t *)pc;
> +     }
> +
> +     for (iter = pi; iter < pi + main_len ; iter++) {
> +             hc128_extract(&ctx->hc128, (uint8_t *)iter);
> +     }
> +
> +     if (final_len) {
> +             hc128_extract(&ctx->hc128, (uint8_t *)&tmp);
> +             memcpy(pi + main_len, &tmp, final_len);
> +     }

You never aligned pi or iter.  Given the alternative header I
suggested, this should be:

if (initial_len) {
        hc128_extract(&ctx->hc128, (uint8_t *)&tmp);
        memcpy(p8, &tmp, initial_len);
}
while (main_len--)
        hc128_extract(&ctx->hc128, p32++);
if (final_len) {
        hc128_extract(&ctx->hc128, (uint8_t *)&tmp);
        memcpy(p32, &tmp, final_len);
}

> +     ctx->numbytes += len;

Check for arithmetic overflow and saturate.

> @@ -1994,6 +1993,10 @@ nfs_getxid(void)
>  {
>       u_int32_t newxid;
>  
> +     if (__predict_false(nfs_xid == 0)) {
> +             nfs_xid = cprng_fast32();
> +     }
> +

Urk.  I guess this can't go early because it needs to happen after
entropy is available, but it may be that none is available until root
has been mounted, and if you have root on nfs...  Can you do this in a
separate change first, at least?  We ought to be able to test this
independently from the cprng_fast changes.

> Index: sys/cprng.h
> ===================================================================
> RCS file: /cvsroot/src/sys/sys/cprng.h,v
> retrieving revision 1.9
> diff -u -p -r1.9 cprng.h
> --- sys/cprng.h       17 Jan 2014 02:08:56 -0000      1.9
> +++ sys/cprng.h       17 Apr 2014 03:17:19 -0000
> @@ -41,42 +41,91 @@
>  #include <sys/rnd.h>         /* XXX users bogusly transitively need this */
>  
>  #include <crypto/nist_ctr_drbg/nist_ctr_drbg.h>
> +#include <crypto/hc128/hc128.h>
> +#include <sys/percpu.h>
> +#include <sys/intr.h>
>  
>  /*
>   * NIST SP800-90 says 2^19 bytes per request for the CTR_DRBG.
>   */
>  #define CPRNG_MAX_LEN        524288
>  
> +#define CPRNGF_MAXBYTES           (512 * 1024 * 1024)
> +#define CPRNGF_HARDMAX            (1 * 1024 * 1024 * 1024)

CPRNGF_HARDMAX seems to be unused.  Where does 512 MB come from for
CPRNGF_MAXBYTES?

> +#define CPRNGF_RESEED_SECONDS     600
> +
> +typedef struct  {
> +        hc128_state_t   hc128;
> +        int             numbytes;

Should be unsigned int, not int.

> +        time_t          nextreseed;
> +} cprng_fast_ctx_t;
>...
> -static inline uint32_t
> -cprng_fast32(void)
> +static inline uint32_t cprng_fast32(void)

Break line as before.

> -static inline uint64_t
> -cprng_fast64(void)
> +static inline uint64_t cprng_fast64(void)

Break line as before.

>  {
> -     uint64_t r;
> -     _arc4randbytes(&r, sizeof(r));
> -     return r;
> +     uint64_t ret;
> +     extern percpu_t *percpu_cprng_fast_ctx;
> +     cprng_fast_ctx_t *ctx = percpu_getref(percpu_cprng_fast_ctx);
> +     int s;
> +
> +     _cprng_fast_checkrekey(ctx);
> +
> +     s = splhigh();
> +     hc128_extract(&ctx->hc128, (uint8_t *)&ret);
> +     hc128_extract(&ctx->hc128, (uint8_t *)(((uint32_t *)&ret) + 1));

Avoid pointer-casting tricks here.  It falls afoul of strict aliasing
rules to access the parts of a uint64_t via uint32_t.  Prefer

hc128_extract(&ctx->hc128, (uint8_t *)&ret);
hc128_extract(&ctx->hc128, (uint8_t *)&ret + 4);

or a union with a buffer, or change hc128_extract to return a uint32_t
and assemble it with shift and or.

> +     ctx->numbytes += sizeof(uint64_t);

Check for arithmetic overflow and saturate.


Home | Main Index | Thread Index | Old Index