tech-crypto archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Changes to make /dev/*random better sooner



Thor Lancelot Simon <tls%panix.com@localhost> wrote:
> On Wed, Apr 09, 2014 at 02:43:23AM +0100, Mindaugas Rasiukevicius wrote:
> >
> > This is a very positive work for the cases when system is used as a
> > server or other workloads which may involve cryptographic activity.
> > However, you seem to assume that aggressive entropy collection is
> > always preferable even if it has extra costs.
> 
> Can you give an example of this?  How exactly will the changes on this
> branch measurably impact performance?  Can you demonstrate with a
> benchmark?
> 
> > Please make this tunable.  Also, please make this tunable *before*
> > merge, otherwise this will just never happen.  Like cprng_fast/strong
> > functions were never optimised to not use spin-lock (at least they no
> > longer use IPL_HIGH, which should have been considered simply as a bug).
> 
> As I believe my benchmarks at the time amply demonstrated, the
> cprng_strong function is considerably more efficient than the entropy
> extraction code it replaced.  Can you please explain how this is an
> example of my supposed disdain for performance?
> 
> Would you prefer the system to be slower, as it was, when user processes
> ask for entropy, rather than faster, as I demonstrably made it?
> <...>

Perhaps I missed it, but were the benchmark results published?  Also, how
about cprng_fast32/64()?  If the current mechanism is indeed faster, then
I am glad that you have made an improvement.

> > Few fragments which caught my eye while skimming through the diff..
> > 
> > >  #if defined(__HAVE_CPU_COUNTER)
> > > - if (cpu_hascounter())
> > > -         return (cpu_counter32());
> > > + if (cpu_hascounter() && sizeof(cpu_counter() == sizeof
> > > (uint64_t))) {
> > > +         return (cpu_counter());
> > > + }
> > >  #endif
> > 
> > ??
> 
> We provide no MI API for obtaining a counter value of any known size
> except 32 bits, unfortunately.  The instrumentation I added <...>

Thanks for explanation, but I was trying to point out the bug.

> > > -         rnd_add_uint32(&skewsrc, rnd_counter());
> > > -         callout_schedule(&skew_callout, hz);
> > > +         rnd_add_uint64(&skewsrc, rnd_counter());
> > > +         callout_schedule(&skew_callout, hz / 10);
> > 
> > Do we really need to run 10 extra callouts per second to get an extra
> > bit?
> 
> Are you seriously telling me that you are concerned that adding 9 callouts
> per second will measurably impact system performance?
> 
> I would submit that the change I made in my previous work on rnd, to which
> you now seem to be retroactively objecting, to eliminate the constant use
> of 1-tick callouts by the entropy addition code, almost certainly more
> than makes up for any performance impact caused by adding 9 callouts per
> second here.

It is a systemic problem.  Everybody thinks 10 extra callouts per second
are peanuts or that an extra worker kthread does not matter, etc.  Where
are we getting in the sum?  There are better solutions, but the quick ad
hoc ones tend to win.  See below.

> 
> > I did not investigate, but we have plenty of already existing callouts.
> > Is, or can, flipflop logic be used with them?  Or deltas between *all*
> > callouts in the system?
> 
> Do you really think that this kind of invasive change to the internals
> of the callout mechanism will be cheaper?

Absolutely.  Because instead of spending the cycles in extra callout
processing and software interrupt scheduling (which probably cost more
than your skew function), it would rely on the fact that other systems
already do this and you can merely perform you skew function.  Of course,
it would have to be done properly and it would need an efficient entropy
collection mechanism.

> > > +static void kprintf_rnd_get(size_t bytes, void *priv)
> > > +{
> > > + if (mutex_tryenter(&kprintf_mtx)) {
> > > +         if (kprnd_added) {
> > > +                 SHA512_Final(kprnd_accum, &kprnd_sha);
> > > +                 rnd_add_data(&rnd_printf_source,
> > > +                              kprnd_accum, sizeof
> > > (kprnd_accum), 0);
> > > +                 kprnd_added = 0;
> > > +                 /* This, we must do, since we called _Final.
> > > */
> > > +                 SHA512_Init(&kprnd_sha);
> > > +                 /* This is optional but seems useful. */
> > > +                 SHA512_Update(&kprnd_sha, kprnd_accum,
> > > +                               sizeof(kprnd_accum));
> > > +         }
> > > +         mutex_exit(&kprintf_mtx);
> > > + }
> > > +}
> > 
> > Pre-checking kprnd_added first rather than locking/unlocking mutex all
> > the time would be a start.. but how about not performing SHA-512 while
> > holding IPL_HIGH mutex?  Not only embedded systems, but perhaps our VAX
> > or m68k users would also prefer to not have micro-freezes every second
> > while printing to a console.
> 
> Glad to move the check to kprnd_added.  Thanks.
> 
> However, can you actually demonstrate that this purported effect is
> perceptible to humans?  We're talking about hashing *bytes of output
> per minute* in the long run on most systems and perhaps a few hundred
> bytes per second in extreme cases.
> 
> SHA512 may be slow, but even on ancient, slow CPUs, it does a lot better
> than a few hundred bytes within the human threshold of perception of
> several milliseconds.
> 
> I cannot perceive the effect of this code run in unaccellerated QEMU
> on an old slow laptop, FWIW; I tested it in this way just to be sure.
> 
> I'm sure it is perceptible if the system is stuck in a tight loop
> issuing printfs, but the system is *already* unusable in that case
> because it is stuck at IPL_HIGH.  Note that to ensure debugging is
> not impeded by this code it doesn't fire if we are printing from DDB.

You are right that it is not a big deal and that a flood printing in
the console already sucks (it can be made more efficient, but that is a
separate issue).  However, it just does not need to be under IPL_HIGH at
all, you can easily manage append-only array without locks (fill, hash,
reset).  Neither it really needs a callout, why not just uvm_scheduler()?

The point I am trying to get is that these changes are added in various
parts of the kernel without putting enough thought.  Entropy collection
can be very opportunistic.  Such mechanisms can be very efficient because
they do not have constrains imposed by the requirements of determinism.
Currently, it is neither efficient nor well written.  I am not blaming
you or anybody else.  As you pointed out, mistakes were made in the past
and your patches just happen to expose some of that.  However, it is a
little bit disappointing to see ad hoc patches and effort put elsewhere,
rather than getting the main interfaces and mechanisms solid first.

I always thought that security software ought to be above the average
quality.  Having it below the average is counterproductive, that just
negates the purpose.  If we really cannot do better than that, then it is
just not worth bothering.  One can close the eyes and just forget about
this.  The system will not break because an attacker will perform some
sophisticated mathematical calculations and simulations to work out your
random numbers.  It will break because of a silly buffer overflow, missing
boundary check, a bracket in the wrong place, or something equally simple
and boring as that.  Just because it is a weak part in a chain which
requires much less effort.

-- 
Mindaugas


Home | Main Index | Thread Index | Old Index