The case of the inconsistent SSL: AES-NI broken? Bad CPU?
New modest-power desktop for the office. And, getting once in a while intermittent encryption errors (e.g. SSL, SSH). See an example:
$ docker pull ubuntu:18.04 18.04: Pulling from library/ubuntu Digest: sha256:5f4bdc3467537cbbe563e80db2c3ec95d548a9145d64453b06939c4592d67b6d Status: Image is up to date for ubuntu:18.04 $ docker pull ubuntu:18.04 Error response from daemon: Get https://registry-1.docker.io/v2/: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "Amazon")
Huh.
Bust out the openssl source, build it, run ‘make tests’. Same intermittent failures (on 20-test_enc_more.t). Compare against my haswell server and my sandybridge desktop, no problems. Try it on my kabylake system. no problems. Its just the skylake-hq.
So let’s try a little experiment (see OPENSSL_ia32cap):
export OPENSSL_ia32cap="~0x200000200000000"
Now the tests pass.
processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 94 model name : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz stepping : 3 microcode : 0xc2
The processor is an i7-6700HQ, it has the AES-NI set. Is that processor/rev broken? Are there errata? Is my chip damaged?
Inquiring minds.
Anyone got a suggestion? Has my chip been hacked to render its encryption weak and this is a by-product? Is my tinfoil hat aligned right?
I once had a PC that seemingly worked well but could not reliably do gzip of large files. Turned out to be flaky memory, even though all mem self-tests passed. Are you over-clocking by any chance?
ram tests pass. this one is not overclocked. its reliably affecting the crypro only as far as i can tell. i tried heat and it didnt make it worse. so i think its not electrical, i think its functional in the (newish) aes-ni instruction set.
How does anyone feel about SKL121 in https://www.intel.co.uk/content/dam/www/public/us/en/documents/specification-updates/desktop-6th-gen-core-family-spec-update.pdf
? The ucode is 0xc2, my read is this should workaround it, but the spec-update talks about a bios update (as a vehicle for ucode? as another workaround?)
>> microcode updated early to revision 0xc2, date = 2017-11-16
Currently I’m struggling with a similar problem which started around June 2018. I did a series of upgrades at home on my i7-6800K running OpenSuSE Leap back then and changed my provider (and thus my cable modem) at the same time. Ever since, I have been experiencing flaky bad mac during ssl handshakes (wireshark) and once a kernel panic during boot due to a bad fisp check on openssl.
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 79
model name : Intel(R) Core(TM) i7-6800K CPU @ 3.40GHz
stepping : 1
microcode : 0xb00001c
I rolled back the openssl and bios firmware and removed the ucode-intel packagte from my system (to forestall uploads of microcode patches) but nothing seamed to change. A series of tests with openssl (with specific ciphers that use the AES-NI) against a server on the same LAN (to eliminate the cable provider factor) show that with OPENSSL_ia32cap=”~0x200000200000000″ the error never occur. This has me suspecting that the AES-NI unit on my CPU is misbehaving.
I ran the openssl ‘make test’ you suggest. I see it failing elsewhere on my system: 80-test_ssl_new.t and 80-test_ssl_old.t. Setting OPENSSL_ia32cap=”~0x200000200000000″ allows the test to pass.
All other devices I own with Intel CPUs work fine but none of them have AES-NI.
I have been wondering if my CPU “broke” back in June. Have you replaced yours? Would seem like an expensive experiment but seeing all the time I have spent typing to narrow the problem down…
I sent mine back home to china, replaced w/ an amd 2700X.
Interesting. Sadly that openssl env variable is not a full fix since some languages (e.g. go) use their own implementation, some things use static libraries (mbedtls, …).
i suppose you could try some ‘freeze spray’… Just spray the cpu w/ a keyboard duster bottle or actual freeze spray while running the test… if there’s a change, for sure its hw.
i think its functional. I wonder if you were to get a 4-yr old version of the library if it would work, or compile with older compiler. that might point to a pipeline issue or instruction optimisation or…