Earlier I talked about crypto-malware spread via popads, an advertising network that does popunder advertisements, and uses a domain-generation-algorithm as a means of avoiding blocking by e.g. adblock. In that I postulated that one could use a neural network to detect the difference between ‘wwiqinsra.bid’ and ‘foobar.com’.

Well, I decided to give it a try. Head on over to my github repo and give it a try. Its far from perfect, so fork and pull request are welcome.

In a nutshell what i did is use the top 1M hosts from Cisco Umbrella, and a dataset from Yhonay github. I split the Yhonay set in half, using half for training, and half for testing. I took an equal number of entries from the top-1M for testing, and for training.

I used Keras, (which sits on top of TensorFlow) and set it up as:

 model=Sequential()
 model.add(Embedding(max_features, 128, input_length=dataset['max_model_len']))
 model.add(LSTM(128))
 model.add(Dropout(0.5))
 model.add(Dense(1))
 model.add(Activation('sigmoid'))
 model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

The code goes and fetches those two data sets (caching them locally between runs). It then trains itself (caching the model), and does the testing.

Does it work? In general yes. The model assesses its own accuracy @ 98%. My test w/ the segregated data disagrees w/ that a bit.

Some names (e.g. dcybolsar) seem to be normal domains when they are popads ones, and some domains (e.g. vnecdn) seem to be popads ones when they are real. But, well, you could slap this model on your little ARM/MIPS based router w/ a bit of Python, and stop seeing ads, I’ll leave that part of the exercise to you.

Anyone have any ideas to improve it? Domain length? adding the TLD (I deleted it from the model), etc.

https://www.cryptokitties.co/. You buy a virtual cat with virtual currency. And then, unlike regular cats, you pay to get them knocked up and have kittens. But its ok, its with virtual currency.

Is this the end-goal of the ‘I own nothing’ world we are entering, where uber + self-driving cars instead of owning, airbnb instead of buying, etc? Where we own virtual pets paid for by numbers we’ve never seen, created by people we’ve never seen? Its very, um, ephemeral.

So, lets look at a sample kitten. Unlike a real cat website (e.g. kittenwar.com … go there now, i demand it!), there’s not much in the way of pictures. This one, ‘Swampgreen!’ has this to say:

Bio
Ciao! I’m Swampgreen !. If you also can’t stand the smell of wet food,
we’re going to be fast friends. Honestly, eating lasagna is all I care
about at this stage in my life. I like your face.

Cattributes
accent colour purplehaze
eye shape slyboots
mouth grim
pattern amur
highlight colour swampgreen
base colour shadowgrey
eye colour mintgreen
fur ragamuffin

Swampgreen! is yours for the low low price of 0.0149eth. If we convert that to a nominal USD amount, that is $13.08. You’d be a fool not to, given that ‘Raga+3xMorning’ is going for 0.0798Eth ($70.07).
Now the most expensive on here are asking 1M Eth (so about $1B), probably just hoping Jeff Bezo’s clicks by accident.
This is our life meow. https://www.reddit.com/r/Thisismylifemeow/

OK, fess up. Most of you run adblock(plus/adaware/pihole/…). And you love it. Other than Forbes, the world is a kinder gentler place.

But recently some of you may have noticed the battery on your laptop is poor, your fans run more. You are a victim of drive-by crypto-mining (probably monero, but could be bitcoin or ethereum). O noes!. How did this sneak in? Did you forget to update your adblock list? But it is. Hmm.

Well, have a look at antipopads. Do you think this will scale for you? It seems the malvertising people have found the same Domain Generation Algorithm (DGA) that the botnets have been using for a decade. They rotate the domain name using a predictable (to them) algorithm, and yet stymie your /etc/hosts or adblock approach, which is dictionary based.

This slide (from cisco) shows an example of how these domains might be generated. And each malware is different, meaning you can’t easily code the algorithm.

What should we do? Should firewall and adblock vendors give up?

Well, you and I, looking at the domain list, might have an inkling of good versus bad domain. Lets try. here is a list of ~10 generated malware sites, and ~10 normal sites.

baidu.com
aaqpajztftqw.com
aanvxbvkdxph.com
facebook.com
abaujsqnndg.bid
aajychvi.bid
aaomstbnbiqo.com
google.co.in
google.com
aaslmqzce.bid
aatfnptblbxpuy.bid
qq.com
reddit.com
wikipedia.org
aaeqlxdgx.bid
yahoo.com
youtube.com
aapxtnrhq.bid

Can you, gentle reader, see which are real and which are not? Yes, it turns out you can. Now, lets examine your algorithm:

  • can i pronounce it (englishness)
  • is it too short? too long?
  • It just looks like/not like a real domain

OK, so I can hire you to fix this problem right? You sit in my basement, and each DNS name I lookup you hit a red/green button? And I sometimes reward you/punish you? Ha!

OK, so if we can teach you to do this, we can teach a machine. More specifically, we teach a machine to teach itself. Here is an example implementation (from a paper here). We have a ‘good’ set (alexa top N), and a ‘bad set’ (from here). These are called labelled data. We then teach a machine algorithm (in this case a Long-Short-Term-Memory (LSTM)). It has a type of memory, it can ‘vaguely’ remember things (sort of like you do), e.g. ‘yahoo.com’ looks kind of like what i expect.

So we create a neural-network. We feed it some known good, and known bad, and correct its guesses, letting it iterate. Eventually it gets good enough we let it rip. We then integrate it in w/ dnsmasq on our firewall. Each new domain that comes through, we ask it: ‘good or dga’? If it says ‘dga’ (it gives a probability, so we would set the threshold to e.g. 80% likely), we respond 127.0.0.1.

This way we don’t have to keep updating lists from e.g. https://github.com/Marfjeh/coinhive-blockhttps://github.com/ZeroDot1/CoinBlockerLists, etc. (and worry that someone has added something they should not have!)

No more drive-by crypto-miner. Until next time 🙂

ps there are some other DGA LSTM implementations out there, feel free to try them and tweak them.

So everyday there is a new site which fesses up that they have been pwned. Someone came in, stole the lot, but don’t worry, they only got something minor. And then a few days later, well, it seems there was a bit more. And by the time you stop caring about the story, it comes out that they got the universe.

And you, despite being a loyal reader of this blog, have used the same password on two sites. And you are pwned. [[ Side note: if you stop reading now, go to this link and check yourself out https://haveibeenpwned.com/ ]]

If you administrate a ‘real $ network’ one of your concerns is your team. You are only as strong as the weakest link. And you just know someone on your team uses the same password on some irrelevant blog as on your key customer data server.

So you concoct a plan. You will download all the leaks, and build a big database of them. A little google, a little dark-web-fu, you are there. You will make some pre-check script on your password db that checks people’s proposed passwords.

And then you run into a bit of an issue. You really need to have this on everyone’s desktop. And its kind of big. And its not obvious you want to do that.

So the path I have been researching is to use a type of AI called a ‘Generic Adversarial Network’. The idea is to train a model on this dataset. You then ship the model to each desktop (and the model cannot be reversed since its lossy). The model would say “this password you propose it is *similar* to the dataset, and thus you should not use it. But, i found this pretty difficult to get correct-enough to use.

So today I found a different solution. Check this ‘Vailidating Leaked Passwords with k-Anonymity‘ by cloudflare. And, cuz its 2018, a sample github repo that implements it. And its in bash! About time bash got some API love. Minus points for cheating and using curl rather than using bash built-in socket support.

So, now you that hypothetical admin, can take that github repo, and put a filter in on the client-side of ‘i want to use this new password’, and be safe.