Skip to content

Commit a206a0b

Browse files
Merge pull request #26 from DevoInc/feat/dga-domain-classifier-keras-onnx-tutorial
feat: add dga domains classifier (keras-onnx) tutorial
2 parents 576d5f1 + 97dde62 commit a206a0b

File tree

4 files changed

+841
-0
lines changed

4 files changed

+841
-0
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,10 @@ All notable changes to this project will be documented in this file.
44
The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
55
and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).
66

7+
## [1.4.0] - 2023-03-07
8+
### Added
9+
* Add DGA domains classifier (keras-onnx) tutorial.
10+
711
## [1.3.0] - 2023-03-03
812
### Added
913
* Add ecommerce clustering tutorial.
Lines changed: 337 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,337 @@
1+
DGA domain classifier (Keras-ONNX)
2+
==================================
3+
4+
This tutorial is related to the
5+
:ref:`DGA domain classifier using H2O engine tutorial
6+
<tutorials/dga-domain-classifier:DGA domain classifier>`
7+
but in this case is used `Keras <https://keras.io/>`_ as machine learning engine.
8+
9+
We are going to use the `Keras framework <https://github.com/keras-team/keras>`_
10+
to create a model capable of detecting whether a domain is malicious
11+
or not. Then in order to be able to register and use the `Keras` model in
12+
`Devo <https://www.devo.com>`_ we will show how to transform it into `ONNX`
13+
format.
14+
15+
.. note::
16+
17+
`The tutorial is available as a Jupyter notebook
18+
<https://github.com/DevoInc/python-mlmodelmanager-client/blob/main/notebooks/dga-domain-classifier-keras-onnx.ipynb>`_.
19+
20+
Build the model
21+
---------------
22+
23+
24+
Let's start by installing the required packages.
25+
26+
.. code-block:: bash
27+
28+
$ pip install devo-sdk
29+
$ pip install devo-mlmodelmanager
30+
$ pip install tensorflow
31+
$ pip install tf2onnx
32+
$ pip install scikit-learn
33+
$ pip install numpy
34+
$ pip install pandas
35+
36+
We can start coding by the needed imports.
37+
38+
.. code-block::
39+
40+
import os
41+
import math
42+
import numpy as np
43+
import pandas as pd
44+
import seaborn as sns
45+
import tensorflow as tf
46+
import tf2onnx
47+
48+
from collections import Counter
49+
from sklearn.preprocessing import LabelEncoder
50+
from devo.api import Client, ClientConfig, SIMPLECOMPACT_TO_OBJ
51+
from devo_ml.modelmanager import create_client_from_token, engines
52+
53+
Declare some constants for convenience in the code.
54+
55+
.. code-block::
56+
57+
# A valid Devo access token
58+
TOKEN = '<your_token_here>'
59+
60+
# URL of Devo API, e.g. https://apiv2-us.devo.com/search/query/
61+
DEVO_API_URL = '<devo_api_url_here>'
62+
63+
# URL of Devo ML Model Manager, e.g. https://api-us.devo.com/mlmodelmanager/
64+
DEVO_MLMM_URL = '<devo_mlmm_url_here>'
65+
66+
# The domain to connect to, e.g. self
67+
DOMAIN = '<your_domain_here>'
68+
69+
# The name of the model
70+
NAME = 'dga_classifier_onnx'
71+
72+
# The description of the models
73+
DESCRIPTION = 'DGA domain classifier (Keras-ONNX)'
74+
75+
# File to store the onnx model
76+
MODEL_FILE = f'{NAME}.onnx'
77+
78+
# The URL of a dataset to build the model
79+
DATASET_URL = "https://devo-ml-models-public-demos.s3.eu-west-3.amazonaws.com/legit_dga/dataset.csv"
80+
81+
VOWELS = "aeiouAEIOU"
82+
83+
# fix random seed for reproducibility
84+
seed = 42
85+
np.random.seed(seed)
86+
87+
This `dataset
88+
<https://devo-ml-models-public-demos.s3.eu-west-3.amazonaws.com/legit_dga/dataset.csv>`_
89+
will help us to train our model once it has been built. The dataset has the
90+
form ``host;domain;class;subclass``.
91+
92+
.. code-block:: text
93+
94+
host;domain;class;subclass
95+
000directory.com.ar;000directory;legit;legit
96+
001fans.com;001fans;legit;legit
97+
...
98+
1002n0q11m17h017r1shexghfqf.net;1002n0q11m17h017r1shexghfqf;dga;newgoz
99+
100bestbuy.com;100bestbuy;legit;legit
100+
...
101+
102+
In the dataset preparation we will add the columns ``length``, ``entropy`` and
103+
``vowel_proportion`` for each domain, and also the flag ``malicious`` indicating
104+
if it is a DGA domain according to the ``class`` column value.
105+
106+
.. code-block::
107+
108+
def entropy(s):
109+
l = len(s)
110+
return -sum(map(lambda a: (a/l)*math.log2(a/l), Counter(s).values()))
111+
112+
domains = pd.read_csv(DATASET_URL, ';')
113+
114+
domains = domains[~domains['subclass'].isna()]
115+
domains['length'] = domains['domain'].str.len()
116+
domains['entropy'] = domains['domain'].apply(lambda row: entropy(row))
117+
domains['vowel_proportion'] = 0
118+
for v in VOWELS:
119+
domains['vowel_proportion'] += domains['domain'].str.count(v)
120+
domains['vowel_proportion'] /= domains['length']
121+
domains['malicious'] = domains['class'] != 'legit'
122+
123+
After preparation our dataset of domains should looks like this.
124+
125+
.. code-block::
126+
127+
>>> domains.head()
128+
host domain class subclass length entropy vowel_proportion malicious
129+
0 000directory.com.ar 000directory legit legit 12 3.022055 0.250000 False
130+
1 000webhost.com 000webhost legit legit 10 2.846439 0.200000 False
131+
2 001fans.com 001fans legit legit 7 2.521641 0.142857 False
132+
3 01-telecharger.com 01-telecharger legit legit 14 3.324863 0.285714 False
133+
4 010shangpu.com 010shangpu legit legit 10 3.121928 0.200000 False
134+
135+
.. note::
136+
137+
Be aware that our dataset is a `pandas.DataFrame
138+
<https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`_.
139+
140+
We are now ready to build the model. We will rely on a
141+
`Keras Sequential model <https://keras.io/guides/sequential_model/>`_ for that.
142+
143+
.. code-block::
144+
145+
model = tf.keras.models.Sequential()
146+
model.add(tf.keras.layers.Dense(
147+
10,
148+
input_dim=3,
149+
activation=tf.nn.relu,
150+
kernel_initializer='he_normal',
151+
kernel_regularizer=tf.keras.regularizers.l2(0.01)
152+
))
153+
model.add(tf.keras.layers.BatchNormalization())
154+
model.add(tf.keras.layers.Dropout(0.3))
155+
model.add(tf.keras.layers.Dense(
156+
7,
157+
activation=tf.nn.relu,
158+
kernel_initializer='he_normal',
159+
kernel_regularizer=tf.keras.regularizers.l1_l2(l1=0.001, l2=0.001)
160+
))
161+
model.add(tf.keras.layers.BatchNormalization())
162+
model.add(tf.keras.layers.Dropout(0.3))
163+
model.add(tf.keras.layers.Dense(
164+
5,
165+
activation=tf.nn.relu,
166+
kernel_initializer='he_normal',
167+
kernel_regularizer=tf.keras.regularizers.l1_l2(l1=0.001, l2=0.001)
168+
))
169+
model.add(tf.keras.layers.Dense(2, activation=tf.nn.softmax))
170+
171+
Before we can train our model we have to properly transform the data for `Keras`.
172+
173+
.. code-block::
174+
175+
Y = domains['malicious']
176+
X = domains.drop(
177+
['host', 'domain', 'class', 'subclass', 'malicious'],
178+
axis=1
179+
)
180+
181+
# Keras requires your output feature to be one-hot encoded values.
182+
lbl_clf = LabelEncoder()
183+
Y_final = tf.keras.utils.to_categorical(lbl_clf.fit_transform(Y))
184+
185+
Let's train our model with our transformed datasets, ``X`` and ``Y_final``.
186+
187+
.. code-block::
188+
189+
model.compile(
190+
optimizer='adam',
191+
loss='categorical_crossentropy',
192+
metrics=['accuracy']
193+
)
194+
model.fit(X , Y_final , epochs=10, batch_size=7)
195+
196+
You will see the progress of the training in the output, something like this.
197+
198+
.. code-block::
199+
200+
>>>
201+
Epoch 1/10
202+
19133/19133 [==============================] - 59s 3ms/step - loss: 0.4520 - accuracy: 0.8100
203+
Epoch 2/10
204+
19133/19133 [==============================] - 58s 3ms/step - loss: 0.4413 - accuracy: 0.8037
205+
Epoch 3/10
206+
19133/19133 [==============================] - 54s 3ms/step - loss: 0.4282 - accuracy: 0.8098
207+
Epoch 4/10
208+
19133/19133 [==============================] - 54s 3ms/step - loss: 0.4301 - accuracy: 0.8098
209+
Epoch 5/10
210+
19133/19133 [==============================] - 55s 3ms/step - loss: 0.4299 - accuracy: 0.8085
211+
Epoch 6/10
212+
19133/19133 [==============================] - 55s 3ms/step - loss: 0.4249 - accuracy: 0.8124
213+
Epoch 7/10
214+
19133/19133 [==============================] - 54s 3ms/step - loss: 0.4284 - accuracy: 0.8101
215+
Epoch 8/10
216+
19133/19133 [==============================] - 57s 3ms/step - loss: 0.4292 - accuracy: 0.8083
217+
Epoch 9/10
218+
19133/19133 [==============================] - 58s 3ms/step - loss: 0.4295 - accuracy: 0.8096
219+
Epoch 10/10
220+
19133/19133 [==============================] - 57s 3ms/step - loss: 0.4278 - accuracy: 0.8091
221+
<keras.callbacks.History at 0x7f02e1620610>
222+
223+
.. note::
224+
225+
The `Keras framework` is beyond the scope of this tutorial, please, refer
226+
to `Keras API reference <https://keras.io/api/>`_ to learn more.
227+
228+
Register the model
229+
------------------
230+
231+
In order to register the model in Devo we need to transform it to `ONNX` format
232+
first.
233+
234+
We will use the
235+
`tf2onnx <https://onnxruntime.ai/docs/tutorials/tf-get-started.html>`_
236+
tool to convert our `Keras` model to `ONNX`.
237+
238+
.. code-block::
239+
240+
tf2onnx.convert.from_keras(model, opset=13, output_path=MODEL_FILE)
241+
242+
Once the model has been transformed and saved, it must be registered on the
243+
Devo platform in order to exploit it.
244+
245+
We will use the ML Model Manager Client for that.
246+
247+
.. code-block::
248+
249+
# create the mlmm client
250+
mlmm = create_client_from_token(DEVO_MLMM_URL, TOKEN)
251+
252+
# register the model
253+
mlmm.add_model(
254+
NAME,
255+
engines.ONNX,
256+
MODEL_FILE,
257+
description=DESCRIPTION,
258+
force=True
259+
)
260+
261+
.. note::
262+
263+
Refer to :ref:`user_guide/index:User's Guide` of this documentation to learn
264+
more about the ML Model Manager Client.
265+
266+
So far we have everything ready to exploit our model, i.e. to detect
267+
malicious domains.
268+
269+
Classify domains
270+
----------------
271+
272+
One way to evaluate a model is to use the ``mlevalmodel(...)`` operator when
273+
querying a table. The ``mlevalmodel(...)`` operator is capable of evaluating
274+
machine learning models and is available in the Devo query engine.
275+
276+
We are going to use the ``demo.ecommerce.data`` table, which contains the
277+
``referralUri`` field, from which we can extract the domain we want to check.
278+
279+
A query that might be worthwhile would be something like this.
280+
281+
.. code-block::
282+
283+
query = f'''from demo.ecommerce.data
284+
select split(referralUri, "/",2) as domain,
285+
float(length(domain)) as length,
286+
shannonentropy(domain) as entropy,
287+
float(countbyfilter(domain, "{VOWELS}")) as vowel_proportion,
288+
at(mlevalmodel("{DOMAIN}", "{NAME}", [float4(length), float4(vowel_proportion)]),0) as res,
289+
ifthenelse(res>0.5, "false", "true") as isMalicious
290+
'''
291+
292+
.. note::
293+
294+
Refer to `Build a query using LINQ
295+
<https://docs.devo.com/space/latest/95191261/Build+a+query+using+LINQ>`_
296+
to learn more about queries.
297+
298+
Well, now we just need to create an access to the Devo API and launch the
299+
query.
300+
301+
With the `Devo Python SDK <https://github.com/DevoInc/python-sdk>`_,
302+
among other features, we can execute queries against the Devo platform easily
303+
and securely.
304+
305+
.. code-block::
306+
307+
# create a Devo API client
308+
api = Client(
309+
auth={"token": TOKEN},
310+
address=DEVO_API_URL,
311+
config=ClientConfig(
312+
response="json/simple/compact",
313+
stream=True,
314+
processor=SIMPLECOMPACT_TO_OBJ
315+
)
316+
)
317+
318+
response = api.query(query=query, dates={'from': "now()-1*hour()"})
319+
320+
for row in response:
321+
print("domain: ",row['domain'], "isMalicious:", row['isMalicious'])
322+
323+
You will see a result like the following depending on the contents of the
324+
``demo.ecommerce.data`` table.
325+
326+
.. code-block::
327+
328+
>>>
329+
domain: www.logcasts.com isMalicious: false
330+
domain: www.google.com isMalicious: false
331+
domain: www.logtrust.com isMalicious: false
332+
...
333+
334+
.. note::
335+
336+
Refer to `Query API <https://docs.devo.com/space/latest/95128275>`_
337+
to learn more about the Devo Query API.

docs/source/tutorials/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,4 +9,5 @@ Manager Client in the process of development a ML model.
99
:caption: Contents
1010

1111
dga-domain-classifier
12+
dga-domain-classifier-keras-onnx
1213
ecommerce_clustering

0 commit comments

Comments
 (0)