Recognising popunder advertisements with machine learning: an implementation
Earlier I talked about crypto-malware spread via popads, an advertising network that does popunder advertisements, and uses a domain-generation-algorithm as a means of avoiding blocking by e.g. adblock. In that I postulated that one could use a neural network to detect the difference between ‘wwiqinsra.bid’ and ‘foobar.com’.
Well, I decided to give it a try. Head on over to my github repo and give it a try. Its far from perfect, so fork and pull request are welcome.
In a nutshell what i did is use the top 1M hosts from Cisco Umbrella, and a dataset from Yhonay github. I split the Yhonay set in half, using half for training, and half for testing. I took an equal number of entries from the top-1M for testing, and for training.
model=Sequential() model.add(Embedding(max_features, 128, input_length=dataset['max_model_len'])) model.add(LSTM(128)) model.add(Dropout(0.5)) model.add(Dense(1)) model.add(Activation('sigmoid')) model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
The code goes and fetches those two data sets (caching them locally between runs). It then trains itself (caching the model), and does the testing.
Does it work? In general yes. The model assesses its own accuracy @ 98%. My test w/ the segregated data disagrees w/ that a bit.
Some names (e.g. dcybolsar) seem to be normal domains when they are popads ones, and some domains (e.g. vnecdn) seem to be popads ones when they are real. But, well, you could slap this model on your little ARM/MIPS based router w/ a bit of Python, and stop seeing ads, I’ll leave that part of the exercise to you.
Anyone have any ideas to improve it? Domain length? adding the TLD (I deleted it from the model), etc.