|
| 1 | +DGA domain classifier (Keras-ONNX) |
| 2 | +================================== |
| 3 | + |
| 4 | +This tutorial is related to the |
| 5 | +:ref:`DGA domain classifier using H2O engine tutorial |
| 6 | +<tutorials/dga-domain-classifier:DGA domain classifier>` |
| 7 | +but in this case is used `Keras <https://keras.io/>`_ as machine learning engine. |
| 8 | + |
| 9 | +We are going to use the `Keras framework <https://github.com/keras-team/keras>`_ |
| 10 | +to create a model capable of detecting whether a domain is malicious |
| 11 | +or not. Then in order to be able to register and use the `Keras` model in |
| 12 | +`Devo <https://www.devo.com>`_ we will show how to transform it into `ONNX` |
| 13 | +format. |
| 14 | + |
| 15 | +.. note:: |
| 16 | + |
| 17 | + `The tutorial is available as a Jupyter notebook |
| 18 | + <https://github.com/DevoInc/python-mlmodelmanager-client/blob/main/notebooks/dga-domain-classifier-keras-onnx.ipynb>`_. |
| 19 | + |
| 20 | +Build the model |
| 21 | +--------------- |
| 22 | + |
| 23 | + |
| 24 | +Let's start by installing the required packages. |
| 25 | + |
| 26 | +.. code-block:: bash |
| 27 | +
|
| 28 | + $ pip install devo-sdk |
| 29 | + $ pip install devo-mlmodelmanager |
| 30 | + $ pip install tensorflow |
| 31 | + $ pip install tf2onnx |
| 32 | + $ pip install scikit-learn |
| 33 | + $ pip install numpy |
| 34 | + $ pip install pandas |
| 35 | +
|
| 36 | +We can start coding by the needed imports. |
| 37 | + |
| 38 | +.. code-block:: |
| 39 | +
|
| 40 | + import os |
| 41 | + import math |
| 42 | + import numpy as np |
| 43 | + import pandas as pd |
| 44 | + import seaborn as sns |
| 45 | + import tensorflow as tf |
| 46 | + import tf2onnx |
| 47 | +
|
| 48 | + from collections import Counter |
| 49 | + from sklearn.preprocessing import LabelEncoder |
| 50 | + from devo.api import Client, ClientConfig, SIMPLECOMPACT_TO_OBJ |
| 51 | + from devo_ml.modelmanager import create_client_from_token, engines |
| 52 | +
|
| 53 | +Declare some constants for convenience in the code. |
| 54 | + |
| 55 | +.. code-block:: |
| 56 | +
|
| 57 | + # A valid Devo access token |
| 58 | + TOKEN = '<your_token_here>' |
| 59 | +
|
| 60 | + # URL of Devo API, e.g. https://apiv2-us.devo.com/search/query/ |
| 61 | + DEVO_API_URL = '<devo_api_url_here>' |
| 62 | +
|
| 63 | + # URL of Devo ML Model Manager, e.g. https://api-us.devo.com/mlmodelmanager/ |
| 64 | + DEVO_MLMM_URL = '<devo_mlmm_url_here>' |
| 65 | +
|
| 66 | + # The domain to connect to, e.g. self |
| 67 | + DOMAIN = '<your_domain_here>' |
| 68 | +
|
| 69 | + # The name of the model |
| 70 | + NAME = 'dga_classifier_onnx' |
| 71 | +
|
| 72 | + # The description of the models |
| 73 | + DESCRIPTION = 'DGA domain classifier (Keras-ONNX)' |
| 74 | +
|
| 75 | + # File to store the onnx model |
| 76 | + MODEL_FILE = f'{NAME}.onnx' |
| 77 | +
|
| 78 | + # The URL of a dataset to build the model |
| 79 | + DATASET_URL = "https://devo-ml-models-public-demos.s3.eu-west-3.amazonaws.com/legit_dga/dataset.csv" |
| 80 | +
|
| 81 | + VOWELS = "aeiouAEIOU" |
| 82 | +
|
| 83 | + # fix random seed for reproducibility |
| 84 | + seed = 42 |
| 85 | + np.random.seed(seed) |
| 86 | +
|
| 87 | +This `dataset |
| 88 | +<https://devo-ml-models-public-demos.s3.eu-west-3.amazonaws.com/legit_dga/dataset.csv>`_ |
| 89 | +will help us to train our model once it has been built. The dataset has the |
| 90 | +form ``host;domain;class;subclass``. |
| 91 | + |
| 92 | +.. code-block:: text |
| 93 | +
|
| 94 | + host;domain;class;subclass |
| 95 | + 000directory.com.ar;000directory;legit;legit |
| 96 | + 001fans.com;001fans;legit;legit |
| 97 | + ... |
| 98 | + 1002n0q11m17h017r1shexghfqf.net;1002n0q11m17h017r1shexghfqf;dga;newgoz |
| 99 | + 100bestbuy.com;100bestbuy;legit;legit |
| 100 | + ... |
| 101 | +
|
| 102 | +In the dataset preparation we will add the columns ``length``, ``entropy`` and |
| 103 | +``vowel_proportion`` for each domain, and also the flag ``malicious`` indicating |
| 104 | +if it is a DGA domain according to the ``class`` column value. |
| 105 | + |
| 106 | +.. code-block:: |
| 107 | +
|
| 108 | + def entropy(s): |
| 109 | + l = len(s) |
| 110 | + return -sum(map(lambda a: (a/l)*math.log2(a/l), Counter(s).values())) |
| 111 | +
|
| 112 | + domains = pd.read_csv(DATASET_URL, ';') |
| 113 | +
|
| 114 | + domains = domains[~domains['subclass'].isna()] |
| 115 | + domains['length'] = domains['domain'].str.len() |
| 116 | + domains['entropy'] = domains['domain'].apply(lambda row: entropy(row)) |
| 117 | + domains['vowel_proportion'] = 0 |
| 118 | + for v in VOWELS: |
| 119 | + domains['vowel_proportion'] += domains['domain'].str.count(v) |
| 120 | + domains['vowel_proportion'] /= domains['length'] |
| 121 | + domains['malicious'] = domains['class'] != 'legit' |
| 122 | +
|
| 123 | +After preparation our dataset of domains should looks like this. |
| 124 | + |
| 125 | +.. code-block:: |
| 126 | +
|
| 127 | + >>> domains.head() |
| 128 | + host domain class subclass length entropy vowel_proportion malicious |
| 129 | + 0 000directory.com.ar 000directory legit legit 12 3.022055 0.250000 False |
| 130 | + 1 000webhost.com 000webhost legit legit 10 2.846439 0.200000 False |
| 131 | + 2 001fans.com 001fans legit legit 7 2.521641 0.142857 False |
| 132 | + 3 01-telecharger.com 01-telecharger legit legit 14 3.324863 0.285714 False |
| 133 | + 4 010shangpu.com 010shangpu legit legit 10 3.121928 0.200000 False |
| 134 | +
|
| 135 | +.. note:: |
| 136 | + |
| 137 | + Be aware that our dataset is a `pandas.DataFrame |
| 138 | + <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`_. |
| 139 | + |
| 140 | +We are now ready to build the model. We will rely on a |
| 141 | +`Keras Sequential model <https://keras.io/guides/sequential_model/>`_ for that. |
| 142 | + |
| 143 | +.. code-block:: |
| 144 | +
|
| 145 | + model = tf.keras.models.Sequential() |
| 146 | + model.add(tf.keras.layers.Dense( |
| 147 | + 10, |
| 148 | + input_dim=3, |
| 149 | + activation=tf.nn.relu, |
| 150 | + kernel_initializer='he_normal', |
| 151 | + kernel_regularizer=tf.keras.regularizers.l2(0.01) |
| 152 | + )) |
| 153 | + model.add(tf.keras.layers.BatchNormalization()) |
| 154 | + model.add(tf.keras.layers.Dropout(0.3)) |
| 155 | + model.add(tf.keras.layers.Dense( |
| 156 | + 7, |
| 157 | + activation=tf.nn.relu, |
| 158 | + kernel_initializer='he_normal', |
| 159 | + kernel_regularizer=tf.keras.regularizers.l1_l2(l1=0.001, l2=0.001) |
| 160 | + )) |
| 161 | + model.add(tf.keras.layers.BatchNormalization()) |
| 162 | + model.add(tf.keras.layers.Dropout(0.3)) |
| 163 | + model.add(tf.keras.layers.Dense( |
| 164 | + 5, |
| 165 | + activation=tf.nn.relu, |
| 166 | + kernel_initializer='he_normal', |
| 167 | + kernel_regularizer=tf.keras.regularizers.l1_l2(l1=0.001, l2=0.001) |
| 168 | + )) |
| 169 | + model.add(tf.keras.layers.Dense(2, activation=tf.nn.softmax)) |
| 170 | +
|
| 171 | +Before we can train our model we have to properly transform the data for `Keras`. |
| 172 | + |
| 173 | +.. code-block:: |
| 174 | +
|
| 175 | + Y = domains['malicious'] |
| 176 | + X = domains.drop( |
| 177 | + ['host', 'domain', 'class', 'subclass', 'malicious'], |
| 178 | + axis=1 |
| 179 | + ) |
| 180 | +
|
| 181 | + # Keras requires your output feature to be one-hot encoded values. |
| 182 | + lbl_clf = LabelEncoder() |
| 183 | + Y_final = tf.keras.utils.to_categorical(lbl_clf.fit_transform(Y)) |
| 184 | +
|
| 185 | +Let's train our model with our transformed datasets, ``X`` and ``Y_final``. |
| 186 | + |
| 187 | +.. code-block:: |
| 188 | +
|
| 189 | + model.compile( |
| 190 | + optimizer='adam', |
| 191 | + loss='categorical_crossentropy', |
| 192 | + metrics=['accuracy'] |
| 193 | + ) |
| 194 | + model.fit(X , Y_final , epochs=10, batch_size=7) |
| 195 | +
|
| 196 | +You will see the progress of the training in the output, something like this. |
| 197 | + |
| 198 | +.. code-block:: |
| 199 | +
|
| 200 | + >>> |
| 201 | + Epoch 1/10 |
| 202 | + 19133/19133 [==============================] - 59s 3ms/step - loss: 0.4520 - accuracy: 0.8100 |
| 203 | + Epoch 2/10 |
| 204 | + 19133/19133 [==============================] - 58s 3ms/step - loss: 0.4413 - accuracy: 0.8037 |
| 205 | + Epoch 3/10 |
| 206 | + 19133/19133 [==============================] - 54s 3ms/step - loss: 0.4282 - accuracy: 0.8098 |
| 207 | + Epoch 4/10 |
| 208 | + 19133/19133 [==============================] - 54s 3ms/step - loss: 0.4301 - accuracy: 0.8098 |
| 209 | + Epoch 5/10 |
| 210 | + 19133/19133 [==============================] - 55s 3ms/step - loss: 0.4299 - accuracy: 0.8085 |
| 211 | + Epoch 6/10 |
| 212 | + 19133/19133 [==============================] - 55s 3ms/step - loss: 0.4249 - accuracy: 0.8124 |
| 213 | + Epoch 7/10 |
| 214 | + 19133/19133 [==============================] - 54s 3ms/step - loss: 0.4284 - accuracy: 0.8101 |
| 215 | + Epoch 8/10 |
| 216 | + 19133/19133 [==============================] - 57s 3ms/step - loss: 0.4292 - accuracy: 0.8083 |
| 217 | + Epoch 9/10 |
| 218 | + 19133/19133 [==============================] - 58s 3ms/step - loss: 0.4295 - accuracy: 0.8096 |
| 219 | + Epoch 10/10 |
| 220 | + 19133/19133 [==============================] - 57s 3ms/step - loss: 0.4278 - accuracy: 0.8091 |
| 221 | + <keras.callbacks.History at 0x7f02e1620610> |
| 222 | +
|
| 223 | +.. note:: |
| 224 | + |
| 225 | + The `Keras framework` is beyond the scope of this tutorial, please, refer |
| 226 | + to `Keras API reference <https://keras.io/api/>`_ to learn more. |
| 227 | + |
| 228 | +Register the model |
| 229 | +------------------ |
| 230 | + |
| 231 | +In order to register the model in Devo we need to transform it to `ONNX` format |
| 232 | +first. |
| 233 | + |
| 234 | +We will use the |
| 235 | +`tf2onnx <https://onnxruntime.ai/docs/tutorials/tf-get-started.html>`_ |
| 236 | +tool to convert our `Keras` model to `ONNX`. |
| 237 | + |
| 238 | +.. code-block:: |
| 239 | +
|
| 240 | + tf2onnx.convert.from_keras(model, opset=13, output_path=MODEL_FILE) |
| 241 | +
|
| 242 | +Once the model has been transformed and saved, it must be registered on the |
| 243 | +Devo platform in order to exploit it. |
| 244 | + |
| 245 | +We will use the ML Model Manager Client for that. |
| 246 | + |
| 247 | +.. code-block:: |
| 248 | +
|
| 249 | + # create the mlmm client |
| 250 | + mlmm = create_client_from_token(DEVO_MLMM_URL, TOKEN) |
| 251 | +
|
| 252 | + # register the model |
| 253 | + mlmm.add_model( |
| 254 | + NAME, |
| 255 | + engines.ONNX, |
| 256 | + MODEL_FILE, |
| 257 | + description=DESCRIPTION, |
| 258 | + force=True |
| 259 | + ) |
| 260 | +
|
| 261 | +.. note:: |
| 262 | + |
| 263 | + Refer to :ref:`user_guide/index:User's Guide` of this documentation to learn |
| 264 | + more about the ML Model Manager Client. |
| 265 | + |
| 266 | +So far we have everything ready to exploit our model, i.e. to detect |
| 267 | +malicious domains. |
| 268 | + |
| 269 | +Classify domains |
| 270 | +---------------- |
| 271 | + |
| 272 | +One way to evaluate a model is to use the ``mlevalmodel(...)`` operator when |
| 273 | +querying a table. The ``mlevalmodel(...)`` operator is capable of evaluating |
| 274 | +machine learning models and is available in the Devo query engine. |
| 275 | + |
| 276 | +We are going to use the ``demo.ecommerce.data`` table, which contains the |
| 277 | +``referralUri`` field, from which we can extract the domain we want to check. |
| 278 | + |
| 279 | +A query that might be worthwhile would be something like this. |
| 280 | + |
| 281 | +.. code-block:: |
| 282 | +
|
| 283 | + query = f'''from demo.ecommerce.data |
| 284 | + select split(referralUri, "/",2) as domain, |
| 285 | + float(length(domain)) as length, |
| 286 | + shannonentropy(domain) as entropy, |
| 287 | + float(countbyfilter(domain, "{VOWELS}")) as vowel_proportion, |
| 288 | + at(mlevalmodel("{DOMAIN}", "{NAME}", [float4(length), float4(vowel_proportion)]),0) as res, |
| 289 | + ifthenelse(res>0.5, "false", "true") as isMalicious |
| 290 | + ''' |
| 291 | +
|
| 292 | +.. note:: |
| 293 | + |
| 294 | + Refer to `Build a query using LINQ |
| 295 | + <https://docs.devo.com/space/latest/95191261/Build+a+query+using+LINQ>`_ |
| 296 | + to learn more about queries. |
| 297 | + |
| 298 | +Well, now we just need to create an access to the Devo API and launch the |
| 299 | +query. |
| 300 | + |
| 301 | +With the `Devo Python SDK <https://github.com/DevoInc/python-sdk>`_, |
| 302 | +among other features, we can execute queries against the Devo platform easily |
| 303 | +and securely. |
| 304 | + |
| 305 | +.. code-block:: |
| 306 | +
|
| 307 | + # create a Devo API client |
| 308 | + api = Client( |
| 309 | + auth={"token": TOKEN}, |
| 310 | + address=DEVO_API_URL, |
| 311 | + config=ClientConfig( |
| 312 | + response="json/simple/compact", |
| 313 | + stream=True, |
| 314 | + processor=SIMPLECOMPACT_TO_OBJ |
| 315 | + ) |
| 316 | + ) |
| 317 | +
|
| 318 | + response = api.query(query=query, dates={'from': "now()-1*hour()"}) |
| 319 | +
|
| 320 | + for row in response: |
| 321 | + print("domain: ",row['domain'], "isMalicious:", row['isMalicious']) |
| 322 | +
|
| 323 | +You will see a result like the following depending on the contents of the |
| 324 | +``demo.ecommerce.data`` table. |
| 325 | + |
| 326 | +.. code-block:: |
| 327 | +
|
| 328 | + >>> |
| 329 | + domain: www.logcasts.com isMalicious: false |
| 330 | + domain: www.google.com isMalicious: false |
| 331 | + domain: www.logtrust.com isMalicious: false |
| 332 | + ... |
| 333 | +
|
| 334 | +.. note:: |
| 335 | + |
| 336 | + Refer to `Query API <https://docs.devo.com/space/latest/95128275>`_ |
| 337 | + to learn more about the Devo Query API. |
0 commit comments