The case of the inconsistent SSL: AES-NI broken? Bad CPU?

New modest-power desktop for the office. And, getting once in a while intermittent encryption errors (e.g. SSL, SSH). See an example:

 

$ docker pull ubuntu:18.04
18.04: Pulling from library/ubuntu
Digest: sha256:5f4bdc3467537cbbe563e80db2c3ec95d548a9145d64453b06939c4592d67b6d
Status: Image is up to date for ubuntu:18.04
$ docker pull ubuntu:18.04
Error response from daemon: Get https://registry-1.docker.io/v2/: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "Amazon")

Huh.

Bust out the openssl source, build it, run ‘make tests’. Same intermittent failures (on 20-test_enc_more.t). Compare against my haswell server and my sandybridge desktop, no problems.  Try it on my kabylake system. no problems. Its just the skylake-hq.

So let’s try a little experiment (see OPENSSL_ia32cap):

export OPENSSL_ia32cap="~0x200000200000000"

Now the tests pass.

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 94
model name	: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
stepping	: 3
microcode	: 0xc2

The processor is an i7-6700HQ, it has the AES-NI set. Is that processor/rev broken? Are there errata? Is my chip damaged?

Inquiring minds.

Anyone got a suggestion? Has my chip been hacked to render its encryption weak and this is a by-product? Is my tinfoil hat aligned right?


Posted

in

by

Tags:

Comments

5 Responses to “The case of the inconsistent SSL: AES-NI broken? Bad CPU?”

  1. Dave D

    I once had a PC that seemingly worked well but could not reliably do gzip of large files. Turned out to be flaky memory, even though all mem self-tests passed. Are you over-clocking by any chance?

    1. db

      ram tests pass. this one is not overclocked. its reliably affecting the crypro only as far as i can tell. i tried heat and it didnt make it worse. so i think its not electrical, i think its functional in the (newish) aes-ni instruction set.

  2. db

    How does anyone feel about SKL121 in https://www.intel.co.uk/content/dam/www/public/us/en/documents/specification-updates/desktop-6th-gen-core-family-spec-update.pdf

    ? The ucode is 0xc2, my read is this should workaround it, but the spec-update talks about a bios update (as a vehicle for ucode? as another workaround?)

    >> microcode updated early to revision 0xc2, date = 2017-11-16

  3. Zyll

    Currently I’m struggling with a similar problem which started around June 2018. I did a series of upgrades at home on my i7-6800K running OpenSuSE Leap back then and changed my provider (and thus my cable modem) at the same time. Ever since, I have been experiencing flaky bad mac during ssl handshakes (wireshark) and once a kernel panic during boot due to a bad fisp check on openssl.

    processor : 0
    vendor_id : GenuineIntel
    cpu family : 6
    model : 79
    model name : Intel(R) Core(TM) i7-6800K CPU @ 3.40GHz
    stepping : 1
    microcode : 0xb00001c

    I rolled back the openssl and bios firmware and removed the ucode-intel packagte from my system (to forestall uploads of microcode patches) but nothing seamed to change. A series of tests with openssl (with specific ciphers that use the AES-NI) against a server on the same LAN (to eliminate the cable provider factor) show that with OPENSSL_ia32cap=”~0x200000200000000″ the error never occur. This has me suspecting that the AES-NI unit on my CPU is misbehaving.
    I ran the openssl ‘make test’ you suggest. I see it failing elsewhere on my system: 80-test_ssl_new.t and 80-test_ssl_old.t. Setting OPENSSL_ia32cap=”~0x200000200000000″ allows the test to pass.

    All other devices I own with Intel CPUs work fine but none of them have AES-NI.

    I have been wondering if my CPU “broke” back in June. Have you replaced yours? Would seem like an expensive experiment but seeing all the time I have spent typing to narrow the problem down…

  4. db

    I sent mine back home to china, replaced w/ an amd 2700X.

    Interesting. Sadly that openssl env variable is not a full fix since some languages (e.g. go) use their own implementation, some things use static libraries (mbedtls, …).

    i suppose you could try some ‘freeze spray’… Just spray the cpu w/ a keyboard duster bottle or actual freeze spray while running the test… if there’s a change, for sure its hw.

    i think its functional. I wonder if you were to get a 4-yr old version of the library if it would work, or compile with older compiler. that might point to a pipeline issue or instruction optimisation or…

Leave a Reply

Your email address will not be published. Required fields are marked *