Skip to content

Commit f42a438

Browse files
Merge pull request #23 from DevoInc/feat/Add_tutorial_legit_dga
feat: Add DGA (Domain generation algorithms) domain classifier notebook
2 parents 84679a6 + b5964ea commit f42a438

File tree

7 files changed

+684
-10
lines changed

7 files changed

+684
-10
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,3 +50,6 @@ venv.bak/
5050

5151
# IntelliJ IDEA
5252
.idea/
53+
54+
# Jupyter
55+
.ipynb_checkpoints

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,10 @@ All notable changes to this project will be documented in this file.
44
The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
55
and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).
66

7+
## [1.2.0] - 2023-02-23
8+
### Added
9+
* Add DGA *(Domain generation algorithms)* domain classifier notebook.
10+
711
## [1.1.1] - 2022-10-28
812
#### Fixed
913
* Sync change log.

README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,21 @@
11
![License](https://img.shields.io/github/license/DevoInc/python-mlmodelmanager-client)
22
![Release](https://img.shields.io/github/v/release/DevoInc/python-mlmodelmanager-client?display_name=tag&sort=semver)
33
![Tests](https://github.com/DevoInc/python-mlmodelmanager-client/actions/workflows/test-tox.yml/badge.svg)
4-
![Python](https://img.shields.io/pypi/pyversions/devo-mlmodelmanager)
4+
![Python](https://img.shields.io/pypi/pyversions/devo-mlmodelmanager)
55

66
# Devo Python ML Model Manager Client
77

8-
The **ML Model Manager** is a service to register machine learning models on
9-
[Devo](https://www.devo.com) platform. These models can be used through the
10-
query engine using the `mlevalmodel(...)` operation or through the
11-
[FLOW](https://docs.devo.com/space/latest/95213164/Flow) correlation engine
12-
including in the context the
8+
The **ML Model Manager** is a service to register machine learning models on
9+
[Devo](https://www.devo.com) platform. These models can be used through the
10+
query engine using the `mlevalmodel(...)` operation or through the
11+
[FLOW](https://docs.devo.com/space/latest/95213164/Flow) correlation engine
12+
including in the context the
1313
[MlSingleModelEval](https://docs.devo.com/space/latest/95214962/ML+Single+Model+Evaluator)
1414
unit.
1515

1616
**devo-mlmodelmanager** provides an easy-to-use client for Devo’s ML Model
17-
Manager. Built on top of the widely used
18-
[Requests](https://requests.readthedocs.io/en/latest/) library exposes a
17+
Manager. Built on top of the widely used
18+
[Requests](https://requests.readthedocs.io/en/latest/) library exposes a
1919
simplified interface for model management, allowing you to focus in the machine
2020
learning workflows and not worry about the integration with Devo platform.
2121

Lines changed: 231 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,231 @@
1+
DGA domain classifier
2+
=====================
3+
4+
This tutorial shows how to perform real-time `DGA` domain classification using
5+
a machine learning model for logs stored in the `Devo <https://www.devo.com>`_
6+
platform.
7+
8+
Domain generation algorithms, `DGA`, are algorithms seen in various families of
9+
malware that are used to periodically generate a large number of domain names
10+
that can be used as rendezvous points with their command and control servers.
11+
12+
For example, an infected computer could create thousands of domain names such
13+
as: www.<gibberish>.com and would attempt to contact a portion of these with
14+
the purpose of receiving an update or commands. See
15+
`DGA on Wikipedia <https://en.wikipedia.org/wiki/Domain_generation_algorithm>`_
16+
to learn more about domain generation algorithms.
17+
18+
We are going to develop a simple machine learning model capable of classifying
19+
malicious domains in the *demo.ecommerce.data* table. We will use the Machine
20+
Learning `H2O <https://h2o.ai/>`_ framework to build and train our model,
21+
the Machine Learning Model Manager Client to register it on the Devo platform,
22+
and the `Devo Python SDK <https://github.com/DevoInc/python-sdk>`_ framework to
23+
evaluate and classify in real time the domains in the table using a query.
24+
25+
.. note::
26+
27+
`The tutorial is available as a Jupyter notebook
28+
<https://github.com/DevoInc/python-mlmodelmanager-client/blob/main/notebooks/dga-domain-classifier.ipynb>`_.
29+
30+
Build the model
31+
---------------
32+
33+
Let's start by installing the required packages.
34+
35+
.. code-block:: bash
36+
37+
$ pip install devo-sdk
38+
$ pip install devo-mlmodelmanager
39+
$ pip install h2o
40+
41+
Then the needed imports.
42+
43+
.. code-block::
44+
45+
import os
46+
import h2o
47+
48+
from h2o.estimators import H2OGradientBoostingEstimator
49+
from devo.api import Client, ClientConfig, SIMPLECOMPACT_TO_OBJ
50+
from devo_ml.modelmanager import create_client_from_token, engines
51+
52+
Declare some constants for convenience in the code.
53+
54+
.. code-block::
55+
56+
# A valid Devo access token
57+
TOKEN = '<your_token_here>'
58+
59+
# URL of Devo API, e.g. https://apiv2-us.devo.com/search/query/
60+
DEVO_API_URL = '<devo_api_url_here>'
61+
62+
# URL of Devo ML Model Manager, e.g. https://api-us.devo.com/mlmodelmanager/
63+
DEVO_MLMM_URL = '<devo_mlmm_url_here>'
64+
65+
# The domain to connect to, e.g. self
66+
DOMAIN = '<your_domain_here>'
67+
68+
# The name of the model
69+
NAME = 'dga_classifier'
70+
71+
# The description of the models
72+
DESCRIPTION = 'DGA domain classifier'
73+
74+
# The path where model file will be stored
75+
MODELS_PATH = '~/models'
76+
77+
# The URL of a dataset to build the model
78+
DATASET_URL = "https://devo-ml-models-public-demos.s3.eu-west-3.amazonaws.com/legit_dga/dataset.csv"
79+
80+
VOWELS = "aeiouAEIOU"
81+
82+
We use the `h2o <https://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/index.html>`_
83+
library to create a model capable of detecting whether a domain is malicious or
84+
not and this `dataset
85+
<https://devo-ml-models-public-demos.s3.eu-west-3.amazonaws.com/legit_dga/dataset.csv>`_
86+
, which has the form: *host;domain;class;subclass.*
87+
88+
.. code-block:: text
89+
90+
host;domain;class;subclass
91+
000directory.com.ar;000directory;legit;legit
92+
000webhost.com;000webhost;legit;legit
93+
001fans.com;001fans;legit;legit
94+
...
95+
1002n0q11m17h017r1shexghfqf.net;1002n0q11m17h017r1shexghfqf;dga;newgoz
96+
1002ra86698fjpgqke1cdvbk5.org;1002ra86698fjpgqke1cdvbk5;dga;newgoz
97+
1008bnt1iekzdt1fqjb76pijxhr.org;1008bnt1iekzdt1fqjb76pijxhr;dga;newgoz
98+
100bestbuy.com;100bestbuy;legit;legit
99+
...
100+
101+
In the dataset preparation we will add the columns ``length``, ``entropy`` and
102+
``vowel_proportion`` for each domain, and also the flag ``malicious`` indicating
103+
if it is a DGA domain according to the ``class`` column value.
104+
105+
As a result we will have a model saved in a file in `~/models`.
106+
107+
.. code-block::
108+
109+
h2o.init()
110+
111+
# import dataset
112+
domains = h2o.import_file(DATASET_URL, header=1)
113+
114+
# Prepare dataset
115+
domains = domains[~domains['subclass'].isna()]
116+
domains['length'] = domains['domain'].nchar()
117+
domains['entropy'] = domains['domain'].entropy()
118+
domains['vowel_proportion'] = 0
119+
for v in VOWELS:
120+
domains['vowel_proportion'] += domains['domain'].countmatches(v)
121+
domains['vowel_proportion'] /= domains['length']
122+
domains['malicious'] = domains['class'] != 'legit'
123+
domains['malicious'] = domains['malicious'].asfactor()
124+
125+
# split dataset
126+
train, valid = domains.split_frame(ratios=[.8], seed=1234)
127+
128+
# create and train the model
129+
model = H2OGradientBoostingEstimator(model_id=NAME)
130+
model.train(
131+
x=['length', 'entropy', 'vowel_proportion'],
132+
y='malicious',
133+
training_frame=train,
134+
validation_frame=valid
135+
)
136+
137+
# save the model
138+
os.makedirs(MODELS_PATH, exist_ok=True)
139+
model.download_mojo(path=MODELS_PATH)
140+
141+
h2o.cluster().shutdown()
142+
143+
.. note::
144+
145+
The aim of this tutorial is to show the integration of the ML Model
146+
Manager Client into the machine learning process not the development of
147+
an optimal and accurate machine learning model.
148+
149+
Register the model
150+
------------------
151+
152+
Once the model has been developed and saved, it must be registered on the
153+
Devo platform in order to exploit it. For this we will use the ML Model Manager
154+
Client.
155+
156+
.. code-block::
157+
158+
# create the mlmm client
159+
client = create_client_from_token(DEVO_MLMM_URL, TOKEN)
160+
161+
# register the model
162+
client.add_model(
163+
NAME,
164+
engines.H2O,
165+
os.path.join(MODELS_PATH, f"{NAME}.zip"),
166+
description=DESCRIPTION,
167+
force=True
168+
)
169+
170+
.. note::
171+
172+
Refer to :ref:`user's guide <User's Guide>` of this documentation to learn
173+
more about the ML Model Manager Client.
174+
175+
So far we have everything ready to exploit our model.
176+
177+
Classify domains
178+
----------------
179+
180+
One way to evaluate a model is is by querying a table and the
181+
``mlevalmodel(...)`` operator available in the Devo query engine capable
182+
of evaluating machine learning models.
183+
184+
We are going to use the *demo.ecommerce.data* table, which contains the
185+
``referralUri`` field, from which we can extract the domain we want to check.
186+
187+
A query that might be worthwhile would be something like this.
188+
189+
.. code-block::
190+
191+
query = f'''from demo.ecommerce.data
192+
select split(referralUri, "/",2) as domain,
193+
float(length(domain)) as length,
194+
shannonentropy(domain) as entropy,
195+
float(countbyfilter(domain, "{VOWELS}")) as vowel_proportion,
196+
mlevalmodel("{DOMAIN}", "{NAME}", length, entropy, vowel_proportion) as class
197+
'''
198+
199+
.. note::
200+
201+
Refer to `Build a query using LINQ
202+
<https://docs.devo.com/space/latest/95191261/Build+a+query+using+LINQ>`_
203+
to learn more about queries.
204+
205+
Well, now we just need to create an access to the Devo API and launch the
206+
query. With the `Devo Python SDK <https://github.com/DevoInc/python-sdk>`_,
207+
among other features, we can execute queries against the Devo platform easily
208+
and securely.
209+
210+
.. code-block::
211+
212+
# create a Devo API client
213+
api = Client(
214+
auth={"token": TOKEN},
215+
address=DEVO_API_URL,
216+
config=ClientConfig(
217+
response="json/simple/compact",
218+
stream=True,
219+
processor=SIMPLECOMPACT_TO_OBJ
220+
)
221+
)
222+
223+
response = api.query(query=query, dates={'from': "now()-1*hour()"})
224+
225+
for row in response:
226+
print(row)
227+
228+
.. note::
229+
230+
Refer to `Query API <https://docs.devo.com/space/latest/95128275>`_
231+
to learn more about the Devo Query API.

docs/source/tutorials/index.rst

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,11 @@
11
Tutorials
22
==========
33

4-
WIP
4+
This section presents some tutorials about how to integrate the Devo ML Model
5+
Manager Client in the process of development a ML model.
6+
7+
.. toctree::
8+
:maxdepth: 2
9+
:caption: Contents
10+
11+
dga-domain-classifier

docs/source/user_guide/client-object.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -394,7 +394,7 @@ provided, if any. So the sequence of directories to look for a profile file is:
394394
* user's home directory, ``/home/<user>/``.
395395

396396
If `path` referrer a file name it will search for that file in every directory
397-
of the set of paths, otherwise if it it referrer a path it perform the same
397+
of the set of paths, otherwise if it referrer a path it perform the same
398398
process using `modelmanager.ini` as a file name. In this way you could just
399399
create a file `modelmanager.ini` with your profiles, place it in the current
400400
directory or in your home directory and create a client without the param

0 commit comments

Comments
 (0)