Spear-Phishing defence, the US census, punycode, and certstream
Want to try something fun? Head here and click on 'open the firehose'. This is certstream, a real-time-stream (from the transparency logs). In real time you can see all the SSL certificates granted. OK, i'll let you you ooh and aah over that for a second.
Now, lets talk about spear-phishing. In essence, I entice you to go to a domain in some fashion, usually by tricking you into thinking it is something else. For example, if your email is with 'gmail', I might register a domain called 'gma1l' (substitute the i for a 1 because its visually similar), and then send you an email suggesting a link based on that. Its surprisingly effective, people's risk-avoidance turn off fairly quickly when presented with a few facts that would be 'hard to know' (e.g. your home address, your mothers maiden name, etc).
Now lets talk about the US Census (these are related... trust me). When they first launched it, people had very rudimentary spelling. And thus you ended up not knowing who was related to who. So they came up with an algorithm called Soundex. This was a method to see if two words sounded similar. Wikipedia has more info.
Now, so how would you protect against spear-phishing? Lets say you had a list of domains at risk (your own domain, your CRM, a few others). You might watch DNS requests, and watch certstream, for things matching in 'soundex'. You would pat yourself on the back, and head home happy because you solved this awful security risk. Not so fast. You achieved only a bit, you need to stay after school. You see, in my example above, I *visually* tricked you. <G><M><A><ONE><L> doesn't sound at all like <G><M><A><I><L>. But it looks the same. So the soundex didn't really help.
Now enter another dimension. Punycode. You see, a few years ago, those that be started to bemoan the US-ASCII nature of domain names. Why couldn't they have accents? Other characters? And this has introduced a whole new area of risk. You see, its quite possible to find a letter in another alphabet that looks the same. Wired did a great article on this problem, called an Ecce Homograph. We can create an alias for this site (donbowman.ca) as (ԁоɴЬоѡⅿаɴ.ϲа), and register that second domain name, and trick you, yes you, gentle reader. And soundex is no match, and neither is the mighty regex. Perhaps we should just despair and hand over our infos?
Well, what if we could come up with a technique? If we turn to the domain of Deep Learning (machine learning), we find that they have been working on image recognition for quite some time. It even got a fantastic video representation in mainstream HBO. You see, using algorithms like Convolutional Neural Network (CNN), we can say "is something like/not like" what we trained on. And in the HBO series 'Silicon Valley' this was at first very poor (shazam for food), but actually very valuable (dick-pick-detector for snap). And similarly here, its valuable or us. The question "my domain or not my domain" is what we want answered. So, perhaps, the way to solve this problem is to watch the certstream, take the resultant domain, render it as a picture, and then run it through the deep learning algorithm, and see if it matches.