Machine Learning as a way to block ads (that carry malware? specifically crypto-miners)

OK, fess up. Most of you run adblock(plus/adaware/pihole/…). And you love it. Other than Forbes, the world is a kinder gentler place.

But recently some of you may have noticed the battery on your laptop is poor, your fans run more. You are a victim of drive-by crypto-mining (probably monero, but could be bitcoin or ethereum). O noes!. How did this sneak in? Did you forget to update your adblock list? But it is. Hmm.

Well, have a look at antipopads. Do you think this will scale for you? It seems the malvertising people have found the same Domain Generation Algorithm (DGA) that the botnets have been using for a decade. They rotate the domain name using a predictable (to them) algorithm, and yet stymie your /etc/hosts or adblock approach, which is dictionary based.

This slide (from cisco) shows an example of how these domains might be generated. And each malware is different, meaning you can’t easily code the algorithm.

What should we do? Should firewall and adblock vendors give up?

Well, you and I, looking at the domain list, might have an inkling of good versus bad domain. Lets try. here is a list of ~10 generated malware sites, and ~10 normal sites.

baidu.com
aaqpajztftqw.com
aanvxbvkdxph.com
facebook.com
abaujsqnndg.bid
aajychvi.bid
aaomstbnbiqo.com
google.co.in
google.com
aaslmqzce.bid
aatfnptblbxpuy.bid
qq.com
reddit.com
wikipedia.org
aaeqlxdgx.bid
yahoo.com
youtube.com
aapxtnrhq.bid

Can you, gentle reader, see which are real and which are not? Yes, it turns out you can. Now, lets examine your algorithm:

  • can i pronounce it (englishness)
  • is it too short? too long?
  • It just looks like/not like a real domain

OK, so I can hire you to fix this problem right? You sit in my basement, and each DNS name I lookup you hit a red/green button? And I sometimes reward you/punish you? Ha!

OK, so if we can teach you to do this, we can teach a machine. More specifically, we teach a machine to teach itself. Here is an example implementation (from a paper here). We have a ‘good’ set (alexa top N), and a ‘bad set’ (from here). These are called labelled data. We then teach a machine algorithm (in this case a Long-Short-Term-Memory (LSTM)). It has a type of memory, it can ‘vaguely’ remember things (sort of like you do), e.g. ‘yahoo.com’ looks kind of like what i expect.

So we create a neural-network. We feed it some known good, and known bad, and correct its guesses, letting it iterate. Eventually it gets good enough we let it rip. We then integrate it in w/ dnsmasq on our firewall. Each new domain that comes through, we ask it: ‘good or dga’? If it says ‘dga’ (it gives a probability, so we would set the threshold to e.g. 80% likely), we respond 127.0.0.1.

This way we don’t have to keep updating lists from e.g. https://github.com/Marfjeh/coinhive-blockhttps://github.com/ZeroDot1/CoinBlockerLists, etc. (and worry that someone has added something they should not have!)

No more drive-by crypto-miner. Until next time 🙂

ps there are some other DGA LSTM implementations out there, feel free to try them and tweak them.


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *