Skip to content

Commit 1a1aa4c

Browse files
Merge pull request #27 from DevoInc/feature/dga-domains-classifier-gb-onnx
Feature/dga domains classifier gb onnx
2 parents f63a639 + e95c65d commit 1a1aa4c

10 files changed

+1159
-110
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,10 @@ All notable changes to this project will be documented in this file.
44
The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
55
and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).
66

7+
## [1.5.0] - 2023-03-13
8+
### Added
9+
* Add DGA domains classifier (gm-onnx) tutorial.
10+
711
## [1.4.0] - 2023-03-07
812
### Added
913
* Add DGA domains classifier (keras-onnx) tutorial.

docs/source/tutorials/dga-domain-classifier-keras-onnx.rst

Lines changed: 32 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,17 @@ format.
1717
`The tutorial is available as a Jupyter notebook
1818
<https://github.com/DevoInc/python-mlmodelmanager-client/blob/main/notebooks/dga-domain-classifier-keras-onnx.ipynb>`_.
1919

20-
Build the model
21-
---------------
20+
Requirements
21+
------------
22+
23+
* Python >= 3.7.
24+
* Devo table ``demo.ecommerce.data``.
25+
26+
It is recommended for convenience to create a virtual environment to run the
27+
tutorial or use the notebook provided.
2228

29+
Setup
30+
-----
2331

2432
Let's start by installing the required packages.
2533

@@ -55,7 +63,7 @@ Declare some constants for convenience in the code.
5563
.. code-block::
5664
5765
# A valid Devo access token
58-
TOKEN = '<your_token_here>'
66+
DEVO_TOKEN = '<your_token_here>'
5967
6068
# URL of Devo API, e.g. https://apiv2-us.devo.com/search/query/
6169
DEVO_API_URL = '<devo_api_url_here>'
@@ -67,13 +75,13 @@ Declare some constants for convenience in the code.
6775
DOMAIN = '<your_domain_here>'
6876
6977
# The name of the model
70-
NAME = 'dga_classifier_onnx'
78+
MODEL_NAME = 'dga_classifier_onnx'
7179
7280
# The description of the models
73-
DESCRIPTION = 'DGA domain classifier (Keras-ONNX)'
81+
MODEL_DESCRIPTION = 'DGA domain classifier (Keras-ONNX)'
7482
7583
# File to store the onnx model
76-
MODEL_FILE = f'{NAME}.onnx'
84+
MODEL_FILE = f'{MODEL_NAME}.onnx'
7785
7886
# The URL of a dataset to build the model
7987
DATASET_URL = "https://devo-ml-models-public-demos.s3.eu-west-3.amazonaws.com/legit_dga/dataset.csv"
@@ -84,6 +92,9 @@ Declare some constants for convenience in the code.
8492
seed = 42
8593
np.random.seed(seed)
8694
95+
Prepare the data
96+
----------------
97+
8798
This `dataset
8899
<https://devo-ml-models-public-demos.s3.eu-west-3.amazonaws.com/legit_dga/dataset.csv>`_
89100
will help us to train our model once it has been built. The dataset has the
@@ -137,6 +148,9 @@ After preparation our dataset of domains should looks like this.
137148
Be aware that our dataset is a `pandas.DataFrame
138149
<https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`_.
139150

151+
Build the model
152+
---------------
153+
140154
We are now ready to build the model. We will rely on a
141155
`Keras Sequential model <https://keras.io/guides/sequential_model/>`_ for that.
142156

@@ -225,19 +239,22 @@ You will see the progress of the training in the output, something like this.
225239
The `Keras framework` is beyond the scope of this tutorial, please, refer
226240
to `Keras API reference <https://keras.io/api/>`_ to learn more.
227241

228-
Register the model
229-
------------------
242+
Transform to ONNX
243+
-----------------
230244

231245
In order to register the model in Devo we need to transform it to `ONNX` format
232246
first.
233247

234248
We will use the
235249
`tf2onnx <https://onnxruntime.ai/docs/tutorials/tf-get-started.html>`_
236-
tool to convert our `Keras` model to `ONNX`.
250+
tool to convert our `Keras` model to `ONNX` and save it.
237251

238252
.. code-block::
239253
240-
tf2onnx.convert.from_keras(model, opset=13, output_path=MODEL_FILE)
254+
onnx_model = tf2onnx.convert.from_keras(model, opset=13, output_path=MODEL_FILE)
255+
256+
Register the model
257+
------------------
241258

242259
Once the model has been transformed and saved, it must be registered on the
243260
Devo platform in order to exploit it.
@@ -247,14 +264,14 @@ We will use the ML Model Manager Client for that.
247264
.. code-block::
248265
249266
# create the mlmm client
250-
mlmm = create_client_from_token(DEVO_MLMM_URL, TOKEN)
267+
mlmm = create_client_from_token(DEVO_MLMM_URL, DEVO_TOKEN)
251268
252269
# register the model
253270
mlmm.add_model(
254-
NAME,
271+
MODEL_NAME,
255272
engines.ONNX,
256273
MODEL_FILE,
257-
description=DESCRIPTION,
274+
description=MODEL_DESCRIPTION,
258275
force=True
259276
)
260277
@@ -285,7 +302,7 @@ A query that might be worthwhile would be something like this.
285302
float(length(domain)) as length,
286303
shannonentropy(domain) as entropy,
287304
float(countbyfilter(domain, "{VOWELS}")) as vowel_proportion,
288-
at(mlevalmodel("{DOMAIN}", "{NAME}", [float4(length), float4(vowel_proportion)]),0) as res,
305+
at(mlevalmodel("{DOMAIN}", "{MODEL_NAME}", [float4(length), float4(vowel_proportion)]),0) as res,
289306
ifthenelse(res>0.5, "false", "true") as isMalicious
290307
'''
291308
@@ -306,7 +323,7 @@ and securely.
306323
307324
# create a Devo API client
308325
api = Client(
309-
auth={"token": TOKEN},
326+
auth={"token": DEVO_TOKEN},
310327
address=DEVO_API_URL,
311328
config=ClientConfig(
312329
response="json/simple/compact",

docs/source/tutorials/dga-domain-classifier.rst

Lines changed: 40 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,17 @@ evaluate and classify in real time the domains in the table using a query.
2727
`The tutorial is available as a Jupyter notebook
2828
<https://github.com/DevoInc/python-mlmodelmanager-client/blob/main/notebooks/dga-domain-classifier.ipynb>`_.
2929

30-
Build the model
31-
---------------
30+
Requirements
31+
------------
32+
33+
* Python >= 3.7.
34+
* Devo table ``demo.ecommerce.data``.
35+
36+
It is recommended for convenience to create a virtual environment to run the
37+
tutorial or use the notebook provided.
38+
39+
Setup
40+
-----
3241

3342
Let's start by installing the required packages.
3443

@@ -54,7 +63,7 @@ Declare some constants for convenience in the code.
5463
.. code-block::
5564
5665
# A valid Devo access token
57-
TOKEN = '<your_token_here>'
66+
DEVO_TOKEN = '<your_token_here>'
5867
5968
# URL of Devo API, e.g. https://apiv2-us.devo.com/search/query/
6069
DEVO_API_URL = '<devo_api_url_here>'
@@ -66,10 +75,10 @@ Declare some constants for convenience in the code.
6675
DOMAIN = '<your_domain_here>'
6776
6877
# The name of the model
69-
NAME = 'dga_classifier'
78+
MODEL_NAME = 'dga_classifier'
7079
7180
# The description of the models
72-
DESCRIPTION = 'DGA domain classifier'
81+
MODEL_DESCRIPTION = 'DGA domain classifier'
7382
7483
# The path where model file will be stored
7584
MODELS_PATH = '~/models'
@@ -79,11 +88,13 @@ Declare some constants for convenience in the code.
7988
8089
VOWELS = "aeiouAEIOU"
8190
82-
We use the `h2o <https://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/index.html>`_
83-
library to create a model capable of detecting whether a domain is malicious or
84-
not and this `dataset
91+
Prepare the data
92+
----------------
93+
94+
This `dataset
8595
<https://devo-ml-models-public-demos.s3.eu-west-3.amazonaws.com/legit_dga/dataset.csv>`_
86-
, which has the form: *host;domain;class;subclass.*
96+
will help us to train our model once it has been built. The dataset has the
97+
form ``host;domain;class;subclass``.
8798

8899
.. code-block:: text
89100
@@ -98,16 +109,12 @@ not and this `dataset
98109
100bestbuy.com;100bestbuy;legit;legit
99110
...
100111
101-
In the dataset preparation we will add the columns ``length``, ``entropy`` and
102-
``vowel_proportion`` for each domain, and also the flag ``malicious`` indicating
103-
if it is a DGA domain according to the ``class`` column value.
104-
105-
As a result we will have a model saved in a file in `~/models`.
112+
We will add the columns ``length``, ``entropy`` and ``vowel_proportion`` for
113+
each domain, and also the flag ``malicious`` indicating if it is a DGA domain
114+
according to the ``class`` column value.
106115

107116
.. code-block::
108117
109-
h2o.init()
110-
111118
# import dataset
112119
domains = h2o.import_file(DATASET_URL, header=1)
113120
@@ -122,11 +129,21 @@ As a result we will have a model saved in a file in `~/models`.
122129
domains['malicious'] = domains['class'] != 'legit'
123130
domains['malicious'] = domains['malicious'].asfactor()
124131
132+
Build the model
133+
---------------
134+
135+
We use the `h2o <https://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/index.html>`_
136+
library to create a model capable of detecting whether a domain is malicious.
137+
138+
.. code-block::
139+
140+
h2o.init()
141+
125142
# split dataset
126143
train, valid = domains.split_frame(ratios=[.8], seed=1234)
127144
128145
# create and train the model
129-
model = H2OGradientBoostingEstimator(model_id=NAME)
146+
model = H2OGradientBoostingEstimator(model_id=MODEL_NAME)
130147
model.train(
131148
x=['length', 'entropy', 'vowel_proportion'],
132149
y='malicious',
@@ -140,12 +157,6 @@ As a result we will have a model saved in a file in `~/models`.
140157
141158
h2o.cluster().shutdown()
142159
143-
.. note::
144-
145-
The aim of this tutorial is to show the integration of the ML Model
146-
Manager Client into the machine learning process not the development of
147-
an optimal and accurate machine learning model.
148-
149160
Register the model
150161
------------------
151162

@@ -156,14 +167,14 @@ Client.
156167
.. code-block::
157168
158169
# create the mlmm client
159-
mlmm = create_client_from_token(DEVO_MLMM_URL, TOKEN)
170+
mlmm = create_client_from_token(DEVO_MLMM_URL, DEVO_TOKEN)
160171
161172
# register the model
162173
mlmm.add_model(
163-
NAME,
174+
MODEL_NAME,
164175
engines.H2O,
165-
os.path.join(MODELS_PATH, f"{NAME}.zip"),
166-
description=DESCRIPTION,
176+
os.path.join(MODELS_PATH, f"{MODEL_NAME}.zip"),
177+
description=MODEL_DESCRIPTION,
167178
force=True
168179
)
169180
@@ -193,7 +204,7 @@ A query that might be worthwhile would be something like this.
193204
float(length(domain)) as length,
194205
shannonentropy(domain) as entropy,
195206
float(countbyfilter(domain, "{VOWELS}")) as vowel_proportion,
196-
mlevalmodel("{DOMAIN}", "{NAME}", length, entropy, vowel_proportion) as class
207+
mlevalmodel("{DOMAIN}", "{MODEL_NAME}", length, entropy, vowel_proportion) as class
197208
'''
198209
199210
.. note::
@@ -211,7 +222,7 @@ and securely.
211222
212223
# create a Devo API client
213224
api = Client(
214-
auth={"token": TOKEN},
225+
auth={"token": DEVO_TOKEN},
215226
address=DEVO_API_URL,
216227
config=ClientConfig(
217228
response="json/simple/compact",

0 commit comments

Comments
 (0)