|
| 1 | +Entity classification with Flow |
| 2 | +=============================== |
| 3 | + |
| 4 | +This tutorial shows how to evaluate a machine learning model using the Devo |
| 5 | +Platform's very own correlation engine: |
| 6 | +`Flow <https://docs.devo.com/space/latest/95213164/Flow>`_. |
| 7 | + |
| 8 | +We will use the model developed in the tutorial |
| 9 | +:ref:`tutorials/entity-classification-kmeans-onnx:Entity classification (kmeans-ONNX)` |
| 10 | +to demonstrate how to classify new data automatically with the desired |
| 11 | +periodicity. |
| 12 | + |
| 13 | +We will also learn how to subscribe to updates of the model itself in order |
| 14 | +to have the latest improvements of its training. |
| 15 | + |
| 16 | +Requirements |
| 17 | +------------ |
| 18 | + |
| 19 | +* Devo table ``demo.ecommerce.data``. |
| 20 | +* ML model from the tutorial |
| 21 | + :ref:`entity classification <tutorials/entity-classification-kmeans-onnx:Entity classification (kmeans-ONNX)>` |
| 22 | + registered in the `Devo <https://www.devo.com>`_ platform. |
| 23 | + |
| 24 | +Create the Flow |
| 25 | +--------------- |
| 26 | + |
| 27 | +One of the uses of `Flow` is the transformation and manipulation of data |
| 28 | +through a series of defined steps. The output of one step is the input to the |
| 29 | +next step(s) and so on. In this way we can define a complex data flow to achieve |
| 30 | +a desired goal. The set of steps to perform a task is known as `context`. |
| 31 | + |
| 32 | +Each of steps in a context is represented by a `Unit <https://docs.devo.com/space/latest/95213765>`_. |
| 33 | +`Flow` provides a wide variety of units to perform lot of individual subtasks. |
| 34 | + |
| 35 | +A relevant unit for our use case is the |
| 36 | +`ML Single Model Evaluator <https://docs.devo.com/space/latest/95214962/ML+Single+Model+Evaluator>`_ |
| 37 | +which is capable of evaluating an ML model, so we need to create a context that |
| 38 | +includes this unit, in addition to others. |
| 39 | + |
| 40 | +This is the layout of the context proposed for our task. |
| 41 | + |
| 42 | +.. image:: /_static/EvalModelTemplate.png |
| 43 | + |
| 44 | +We can |
| 45 | +`create the context from scratch <https://docs.devo.com/space/latest/95213200/Creating+a+Flow>`_, |
| 46 | +but there is an easier way to do it if we have the context specification, which |
| 47 | +is our case. |
| 48 | + |
| 49 | +The :download:`EvalModelTemplate.json </_static/EvalModelTemplate.json>` |
| 50 | +specification contains all the specification to create the suitable context. |
| 51 | +We can import the context from ``Flow welcome page`` of `Flow` application |
| 52 | +using the option ``Import from json``, so download it and import it. |
| 53 | + |
| 54 | +.. code-block:: |
| 55 | + :caption: EvalModelTemplate.json |
| 56 | +
|
| 57 | + { |
| 58 | + "id": null, |
| 59 | + "description": "", |
| 60 | + "name": "EvalModelTemplate", |
| 61 | + "type": "app.Context", |
| 62 | + "templateFormat": "%mono", |
| 63 | + "loggedItems": [], |
| 64 | + "variableConf": { |
| 65 | + "variables": [{ |
| 66 | + "id": "d310c38c-pa41-iebc-9cfa-d52483cffdbb", |
| 67 | + "name": "OUT_TABLE", |
| 68 | + "description": "Devo table to write to.", |
| 69 | + "type": "string", |
| 70 | + "value": "my.app.evalmodel.demo" |
| 71 | + }, |
| 72 | + ... |
| 73 | + } |
| 74 | +
|
| 75 | +.. note:: |
| 76 | + |
| 77 | + The specification of a context is usually very long, here it has been |
| 78 | + shortened for tutorial readability. |
| 79 | + |
| 80 | +At this point we should have a ready-to-use context that reads data from the |
| 81 | +``demo.ecommerce.data`` table, evaluates it through an ML model and writes the |
| 82 | +results to the ``my.app.evalmodel.demo`` table, all just out of the box |
| 83 | +importing a context specification. |
| 84 | + |
| 85 | +But a context does not have to be static, it can be adjusted by variables and |
| 86 | +its units are also configurable, let's see how. |
| 87 | + |
| 88 | +Variables |
| 89 | +--------- |
| 90 | + |
| 91 | +Although the context is ready out of the box after import, we can adjust it by |
| 92 | +means of a set of variables |
| 93 | + |
| 94 | +From ``File \ Variables`` option of the main menu of `Flow` application you |
| 95 | +can tune the variable values. |
| 96 | + |
| 97 | +.. list-table:: Variable defaults |
| 98 | + :header-rows: 1 |
| 99 | + |
| 100 | + * - NAME |
| 101 | + - TYPE |
| 102 | + - DEFINITION |
| 103 | + - DEFAULT VALUE |
| 104 | + * - EVERY |
| 105 | + - ``String`` |
| 106 | + - Cron expression to define how often queries will be launched |
| 107 | + - ``0 0 0 ? * * *`` |
| 108 | + * - QUERY_DURATION |
| 109 | + - ``Long`` |
| 110 | + - Length in milliseconds of the period the query should consider |
| 111 | + - ``86400000`` |
| 112 | + * - QUERY_MODULE |
| 113 | + - ``Integer`` |
| 114 | + - Number of chunks to split the query |
| 115 | + - ``24`` |
| 116 | + * - QUERY_COLUMN |
| 117 | + - ``String`` |
| 118 | + - Name of the query column used to split the query into chunks |
| 119 | + - ``clientIpAddress`` |
| 120 | + * - OUT_TABLE |
| 121 | + - ``String`` |
| 122 | + - Name of the table where results will be written |
| 123 | + - ``my.app.evalmodel.demo`` |
| 124 | + |
| 125 | +If you want to change the ``EVERY`` variable be aware that you must change the |
| 126 | +``QUERY_DURATION`` variable accordingly. For instance, if you want to run the |
| 127 | +query every 2 hours, ``QUERY_DURATION`` should be ``7200000`` |
| 128 | +`(2h * 60min/h * 60s/min * 1000ms/s)`. |
| 129 | + |
| 130 | +Units overview |
| 131 | +-------------- |
| 132 | + |
| 133 | +Let's see the units of the layout. |
| 134 | + |
| 135 | +#. | **EveryX** [`event.Scheduler <https://docs.devo.com/space/latest/95214759/Scheduler>`_] |
| 136 | + |
| 137 | + | This unit is configured to start working when the context is run. |
| 138 | + Everytime this unit sends an event, queries will be launched. To change |
| 139 | + how often queries should be launched, you can change the value of the Flow |
| 140 | + variable ``EVERY``. By default, it is configured to send an event everyday |
| 141 | + at midnight. |
| 142 | +
|
| 143 | +#. | **SetFromAndToDates** [`core.Map <https://docs.devo.com/space/latest/95214080/Map>`_] |
| 144 | + | This unit uses the date generated in the previous unit to set the ``from`` |
| 145 | + and ``to`` timestamps for the query. The variable ``QUERY_DURATION`` sets |
| 146 | + the ``from`` timestamp. By default, it is configured to 24h. |
| 147 | +
|
| 148 | +#. | **MultiplyByNToSplitQuery** [`event.Repeater <https://docs.devo.com/space/latest/95214715/Repeater>`_] |
| 149 | + | This unit generates ``N`` events for every event received, all of them |
| 150 | + with the same content except for a ``counter``. We will use that counter |
| 151 | + to split the results of the query when there are too many values. |
| 152 | +
|
| 153 | +#. | **QueryData** [`devo.DevoFullQuery <https://docs.devo.com/space/latest/95214466/Devo+Full+Query>`_] |
| 154 | + | This is the unit that gets data from the table ``demo.ecommerce.data`` |
| 155 | + using the Query Engine and the dates calculated in the **SetFromAndToDates** |
| 156 | + unit. We added a module operation to split the query into smaller queries |
| 157 | + just in case there are too many results to handle in one go. |
| 158 | +
|
| 159 | + | Note that this unit is connected to the **ModelEvaluator** unit that |
| 160 | + evaluates the input with a machine learning model, therefore the output of |
| 161 | + the unit must match the input fields definition of the ML model, i.e., same |
| 162 | + name and type. In our case, the model expects an array of five ``float4`` |
| 163 | + elements named ``field_0``, so we built the query with that output: |
| 164 | +
|
| 165 | + .. code-block:: |
| 166 | + :caption: QueryData::General::Query |
| 167 | + :emphasize-lines: 11-18 |
| 168 | +
|
| 169 | + select ? as counter |
| 170 | + from demo.ecommerce.data where isnotnull(clientIpAddress) |
| 171 | + where (abs(hash(#{QUERY_COLUMN}))) %% #{QUERY_MODULE} = counter |
| 172 | + select |
| 173 | + hour(eventdate) as hour, |
| 174 | + minute(eventdate) as minute, |
| 175 | + second(eventdate) as second, |
| 176 | + clientIpAddress, |
| 177 | + userAgent |
| 178 | + group every 8h by clientIpAddress |
| 179 | + select |
| 180 | + str(clientIpAddress) as sourceIp, |
| 181 | + float4(size(collectcompact(hour))) as unique_hours, |
| 182 | + float4(size(collectcompact(minute))) as unique_mins, |
| 183 | + float4(size(collectcompact(second))) as unique_seconds, |
| 184 | + float4(size(collectcompact(userAgent))) as unique_user_agents, |
| 185 | + float4(avg(bytesTransferred)) as bytestransferred, |
| 186 | + [unique_hours, unique_mins, unique_seconds, unique_user_agents, bytestransferred] as field_0 |
| 187 | +
|
| 188 | +#. | **CheckModelUpdates** [`ml.MlUpdateNotifier <https://docs.devo.com/space/latest/95215021/ML+Update+Notifier>`_] |
| 189 | + | This unit keeps your ML model updated. If you upload a new version of your |
| 190 | + model keeping the name, your context will refresh the model image to start |
| 191 | + using the new one. |
| 192 | +
|
| 193 | + | If you do not plan to update your model ever, you can remove this unit. |
| 194 | +
|
| 195 | +#. | **ModelEvaluator** [`ml.MlSingleModelEval <https://docs.devo.com/space/latest/95214962/ML+Single+Model+Evaluator>`_] |
| 196 | + | This unit evaluates the specified model with the input data and writes the |
| 197 | + result in a new field with the specified output name. |
| 198 | +
|
| 199 | + | The unit was pre-configured to use a model named ``entity_classification_ip``, |
| 200 | + which is the name of the model built in the |
| 201 | + :ref:`entity classification <tutorials/entity-classification-kmeans-onnx:Entity classification (kmeans-ONNX)>` |
| 202 | + tutorial. You can change the model to use in the ``Model Name`` parameter |
| 203 | + of the unit, but note that you must update the query accordingly to match |
| 204 | + the fields of both. |
| 205 | +
|
| 206 | +#. | **FormatEvents** [`core.Map <https://docs.devo.com/space/latest/95214080/Map>`_] |
| 207 | + | This unit *adds/removes/reformats* fields to get an event that fits the |
| 208 | + expected format in the output table. |
| 209 | +
|
| 210 | +#. | **InjectResults** [`devo.DevoSink <https://docs.devo.com/space/latest/95214553/Devo+Sink>`_] |
| 211 | + | Writes the enriched events back to Devo to ``OUT_TABLE`` table, defaults |
| 212 | + to ``my.app.evalmodel.demo``. |
| 213 | +
|
| 214 | +.. note:: |
| 215 | + |
| 216 | + Refer to `Flow documentation <https://docs.devo.com/space/latest/95213164/Flow>`_ |
| 217 | + to learn more about Flow and their units. |
| 218 | + |
| 219 | +Remove query splitting |
| 220 | +---------------------- |
| 221 | + |
| 222 | +In this tutorial we split the query into N smaller queries which is a good |
| 223 | +approach if you expect a large amounts of data. if it is not the case you can |
| 224 | +remove the splitting. |
| 225 | + |
| 226 | +To do so in the **QueryData** unit remove the ``counter`` in `Parameters` |
| 227 | +section and modify the query by removing these lines: |
| 228 | + |
| 229 | +.. code-block:: |
| 230 | + :caption: QueryData::General::Query |
| 231 | + :emphasize-lines: 1, 3 |
| 232 | +
|
| 233 | + select ? as counter |
| 234 | + from demo.ecommerce.data where isnotnull(clientIpAddress) |
| 235 | + where (abs(hash(#{QUERY_COLUMN}))) %% #{QUERY_MODULE} = counter |
| 236 | + select |
| 237 | + hour(eventdate) as hour, |
| 238 | + minute(eventdate) as minute, |
| 239 | + second(eventdate) as second, |
| 240 | + clientIpAddress, |
| 241 | + userAgent |
| 242 | + group every 8h by clientIpAddress |
| 243 | + select |
| 244 | + str(clientIpAddress) as sourceIp, |
| 245 | + float4(size(collectcompact(hour))) as unique_hours, |
| 246 | + float4(size(collectcompact(minute))) as unique_mins, |
| 247 | + float4(size(collectcompact(second))) as unique_seconds, |
| 248 | + float4(size(collectcompact(userAgent))) as unique_user_agents, |
| 249 | + float4(avg(bytesTransferred)) as bytestransferred, |
| 250 | + [unique_hours, unique_mins, unique_seconds, unique_user_agents, bytestransferred] as field_0 |
| 251 | +
|
| 252 | +Without spliting you can still adjust the frequency of the query by changing the |
| 253 | +``EVERY`` and ``QUERY_DURATION`` variables according to your needs. |
0 commit comments