|
| 1 | +DGA domain classifier |
| 2 | +===================== |
| 3 | + |
| 4 | +This tutorial shows how to perform real-time `DGA` domain classification using |
| 5 | +a machine learning model for logs stored in the `Devo <https://www.devo.com>`_ |
| 6 | +platform. |
| 7 | + |
| 8 | +Domain generation algorithms, `DGA`, are algorithms seen in various families of |
| 9 | +malware that are used to periodically generate a large number of domain names |
| 10 | +that can be used as rendezvous points with their command and control servers. |
| 11 | + |
| 12 | +For example, an infected computer could create thousands of domain names such |
| 13 | +as: www.<gibberish>.com and would attempt to contact a portion of these with |
| 14 | +the purpose of receiving an update or commands. See |
| 15 | +`DGA on Wikipedia <https://en.wikipedia.org/wiki/Domain_generation_algorithm>`_ |
| 16 | +to learn more about domain generation algorithms. |
| 17 | + |
| 18 | +We are going to develop a simple machine learning model capable of classifying |
| 19 | +malicious domains in the *demo.ecommerce.data* table. We will use the Machine |
| 20 | +Learning `H2O <https://h2o.ai/>`_ framework to build and train our model, |
| 21 | +the Machine Learning Model Manager Client to register it on the Devo platform, |
| 22 | +and the `Devo Python SDK <https://github.com/DevoInc/python-sdk>`_ framework to |
| 23 | +evaluate and classify in real time the domains in the table using a query. |
| 24 | + |
| 25 | +.. note:: |
| 26 | + |
| 27 | + `The tutorial is available as a Jupyter notebook |
| 28 | + <https://github.com/DevoInc/python-mlmodelmanager-client/blob/main/notebooks/dga-domain-classifier.ipynb>`_. |
| 29 | + |
| 30 | +Build the model |
| 31 | +--------------- |
| 32 | + |
| 33 | +Let's start by installing the required packages. |
| 34 | + |
| 35 | +.. code-block:: bash |
| 36 | +
|
| 37 | + $ pip install devo-sdk |
| 38 | + $ pip install devo-mlmodelmanager |
| 39 | + $ pip install h2o |
| 40 | +
|
| 41 | +Then the needed imports. |
| 42 | + |
| 43 | +.. code-block:: |
| 44 | +
|
| 45 | + import os |
| 46 | + import h2o |
| 47 | +
|
| 48 | + from h2o.estimators import H2OGradientBoostingEstimator |
| 49 | + from devo.api import Client, ClientConfig, SIMPLECOMPACT_TO_OBJ |
| 50 | + from devo_ml.modelmanager import create_client_from_token, engines |
| 51 | +
|
| 52 | +Declare some constants for convenience in the code. |
| 53 | + |
| 54 | +.. code-block:: |
| 55 | +
|
| 56 | + # A valid Devo access token |
| 57 | + TOKEN = '<your_token_here>' |
| 58 | +
|
| 59 | + # URL of Devo API, e.g. https://apiv2-us.devo.com/search/query/ |
| 60 | + DEVO_API_URL = '<devo_api_url_here>' |
| 61 | +
|
| 62 | + # URL of Devo ML Model Manager, e.g. https://api-us.devo.com/mlmodelmanager/ |
| 63 | + DEVO_MLMM_URL = '<devo_mlmm_url_here>' |
| 64 | +
|
| 65 | + # The domain to connect to, e.g. self |
| 66 | + DOMAIN = '<your_domain_here>' |
| 67 | +
|
| 68 | + # The name of the model |
| 69 | + NAME = 'dga_classifier' |
| 70 | +
|
| 71 | + # The description of the models |
| 72 | + DESCRIPTION = 'DGA domain classifier' |
| 73 | +
|
| 74 | + # The path where model file will be stored |
| 75 | + MODELS_PATH = '~/models' |
| 76 | +
|
| 77 | + # The URL of a dataset to build the model |
| 78 | + DATASET_URL = "https://devo-ml-models-public-demos.s3.eu-west-3.amazonaws.com/legit_dga/dataset.csv" |
| 79 | +
|
| 80 | + VOWELS = "aeiouAEIOU" |
| 81 | +
|
| 82 | +We use the `h2o <https://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/index.html>`_ |
| 83 | +library to create a model capable of detecting whether a domain is malicious or |
| 84 | +not and this `dataset |
| 85 | +<https://devo-ml-models-public-demos.s3.eu-west-3.amazonaws.com/legit_dga/dataset.csv>`_ |
| 86 | +, which has the form: *host;domain;class;subclass.* |
| 87 | + |
| 88 | +.. code-block:: text |
| 89 | +
|
| 90 | + host;domain;class;subclass |
| 91 | + 000directory.com.ar;000directory;legit;legit |
| 92 | + 000webhost.com;000webhost;legit;legit |
| 93 | + 001fans.com;001fans;legit;legit |
| 94 | + ... |
| 95 | + 1002n0q11m17h017r1shexghfqf.net;1002n0q11m17h017r1shexghfqf;dga;newgoz |
| 96 | + 1002ra86698fjpgqke1cdvbk5.org;1002ra86698fjpgqke1cdvbk5;dga;newgoz |
| 97 | + 1008bnt1iekzdt1fqjb76pijxhr.org;1008bnt1iekzdt1fqjb76pijxhr;dga;newgoz |
| 98 | + 100bestbuy.com;100bestbuy;legit;legit |
| 99 | + ... |
| 100 | +
|
| 101 | +In the dataset preparation we will add the columns ``length``, ``entropy`` and |
| 102 | +``vowel_proportion`` for each domain, and also the flag ``malicious`` indicating |
| 103 | +if it is a DGA domain according to the ``class`` column value. |
| 104 | + |
| 105 | +As a result we will have a model saved in a file in `~/models`. |
| 106 | + |
| 107 | +.. code-block:: |
| 108 | +
|
| 109 | + h2o.init() |
| 110 | +
|
| 111 | + # import dataset |
| 112 | + domains = h2o.import_file(DATASET_URL, header=1) |
| 113 | +
|
| 114 | + # Prepare dataset |
| 115 | + domains = domains[~domains['subclass'].isna()] |
| 116 | + domains['length'] = domains['domain'].nchar() |
| 117 | + domains['entropy'] = domains['domain'].entropy() |
| 118 | + domains['vowel_proportion'] = 0 |
| 119 | + for v in VOWELS: |
| 120 | + domains['vowel_proportion'] += domains['domain'].countmatches(v) |
| 121 | + domains['vowel_proportion'] /= domains['length'] |
| 122 | + domains['malicious'] = domains['class'] != 'legit' |
| 123 | + domains['malicious'] = domains['malicious'].asfactor() |
| 124 | +
|
| 125 | + # split dataset |
| 126 | + train, valid = domains.split_frame(ratios=[.8], seed=1234) |
| 127 | +
|
| 128 | + # create and train the model |
| 129 | + model = H2OGradientBoostingEstimator(model_id=NAME) |
| 130 | + model.train( |
| 131 | + x=['length', 'entropy', 'vowel_proportion'], |
| 132 | + y='malicious', |
| 133 | + training_frame=train, |
| 134 | + validation_frame=valid |
| 135 | + ) |
| 136 | +
|
| 137 | + # save the model |
| 138 | + os.makedirs(MODELS_PATH, exist_ok=True) |
| 139 | + model.download_mojo(path=MODELS_PATH) |
| 140 | +
|
| 141 | + h2o.cluster().shutdown() |
| 142 | +
|
| 143 | +.. note:: |
| 144 | + |
| 145 | + The aim of this tutorial is to show the integration of the ML Model |
| 146 | + Manager Client into the machine learning process not the development of |
| 147 | + an optimal and accurate machine learning model. |
| 148 | + |
| 149 | +Register the model |
| 150 | +------------------ |
| 151 | + |
| 152 | +Once the model has been developed and saved, it must be registered on the |
| 153 | +Devo platform in order to exploit it. For this we will use the ML Model Manager |
| 154 | +Client. |
| 155 | + |
| 156 | +.. code-block:: |
| 157 | +
|
| 158 | + # create the mlmm client |
| 159 | + client = create_client_from_token(DEVO_MLMM_URL, TOKEN) |
| 160 | +
|
| 161 | + # register the model |
| 162 | + client.add_model( |
| 163 | + NAME, |
| 164 | + engines.H2O, |
| 165 | + os.path.join(MODELS_PATH, f"{NAME}.zip"), |
| 166 | + description=DESCRIPTION, |
| 167 | + force=True |
| 168 | + ) |
| 169 | +
|
| 170 | +.. note:: |
| 171 | + |
| 172 | + Refer to :ref:`user's guide <User's Guide>` of this documentation to learn |
| 173 | + more about the ML Model Manager Client. |
| 174 | + |
| 175 | +So far we have everything ready to exploit our model. |
| 176 | + |
| 177 | +Classify domains |
| 178 | +---------------- |
| 179 | + |
| 180 | +One way to evaluate a model is is by querying a table and the |
| 181 | +``mlevalmodel(...)`` operator available in the Devo query engine capable |
| 182 | +of evaluating machine learning models. |
| 183 | + |
| 184 | +We are going to use the *demo.ecommerce.data* table, which contains the |
| 185 | +``referralUri`` field, from which we can extract the domain we want to check. |
| 186 | + |
| 187 | +A query that might be worthwhile would be something like this. |
| 188 | + |
| 189 | +.. code-block:: |
| 190 | +
|
| 191 | + query = f'''from demo.ecommerce.data |
| 192 | + select split(referralUri, "/",2) as domain, |
| 193 | + float(length(domain)) as length, |
| 194 | + shannonentropy(domain) as entropy, |
| 195 | + float(countbyfilter(domain, "{VOWELS}")) as vowel_proportion, |
| 196 | + mlevalmodel("{DOMAIN}", "{NAME}", length, entropy, vowel_proportion) as class |
| 197 | + ''' |
| 198 | +
|
| 199 | +.. note:: |
| 200 | + |
| 201 | + Refer to `Build a query using LINQ |
| 202 | + <https://docs.devo.com/space/latest/95191261/Build+a+query+using+LINQ>`_ |
| 203 | + to learn more about queries. |
| 204 | + |
| 205 | +Well, now we just need to create an access to the Devo API and launch the |
| 206 | +query. With the `Devo Python SDK <https://github.com/DevoInc/python-sdk>`_, |
| 207 | +among other features, we can execute queries against the Devo platform easily |
| 208 | +and securely. |
| 209 | + |
| 210 | +.. code-block:: |
| 211 | +
|
| 212 | + # create a Devo API client |
| 213 | + api = Client( |
| 214 | + auth={"token": TOKEN}, |
| 215 | + address=DEVO_API_URL, |
| 216 | + config=ClientConfig( |
| 217 | + response="json/simple/compact", |
| 218 | + stream=True, |
| 219 | + processor=SIMPLECOMPACT_TO_OBJ |
| 220 | + ) |
| 221 | + ) |
| 222 | +
|
| 223 | + response = api.query(query=query, dates={'from': "now()-1*hour()"}) |
| 224 | +
|
| 225 | + for row in response: |
| 226 | + print(row) |
| 227 | +
|
| 228 | +.. note:: |
| 229 | + |
| 230 | + Refer to `Query API <https://docs.devo.com/space/latest/95128275>`_ |
| 231 | + to learn more about the Devo Query API. |
0 commit comments