Skip to content

Commit f80198d

Browse files
Merge pull request #30 from DevoInc/feature/entity-classification-flow-tutorial
feat: add entity classification with Flow tutorial
2 parents be0490c + abe57d2 commit f80198d

File tree

9 files changed

+263
-2
lines changed

9 files changed

+263
-2
lines changed

.github/workflows/documentation.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ jobs:
2222
- name: Install Poetry
2323
uses: snok/install-poetry@v1
2424
with:
25+
version: 1.4.0
2526
virtualenvs-create: true
2627
virtualenvs-in-project: true
2728
installer-parallel: true

.github/workflows/publish-pypi.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ jobs:
2525
- name: Install dependencies
2626
run: |
2727
python -m pip install --upgrade pip
28-
pip install poetry
28+
pip install poetry==1.4.0
2929
- name: Build package
3030
run: poetry build
3131
- name: Publish a Python distribution to PyPI

.github/workflows/publish-test.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ jobs:
2828
- name: Install dependencies
2929
run: |
3030
python -m pip install --upgrade pip
31-
pip install poetry
31+
pip install poetry==1.4.0
3232
- name: Build package
3333
run: poetry build
3434
- name: Publish a Python distribution to TestPyPI

.github/workflows/test-tox.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ jobs:
1717
python-version: ${{ matrix.python-version }}
1818
- uses: snok/install-poetry@v1
1919
with:
20+
version: 1.4.0
2021
virtualenvs-in-project: true
2122
- name: Install dependencies
2223
run: |

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,10 @@ All notable changes to this project will be documented in this file.
44
The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
55
and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).
66

7+
## [1.6.0] - 2023-03-23
8+
### Added
9+
* Add entity classification with Flow tutorial.
10+
711
## [1.5.2] - 2023-03-13
812
### Fixed
913
* Rename DGA domains scoring (gbm-onnx) tutorial.

docs/source/_static/EvalModelTemplate.json

Lines changed: 1 addition & 0 deletions
Large diffs are not rendered by default.
84.2 KB
Loading
Lines changed: 253 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,253 @@
1+
Entity classification with Flow
2+
===============================
3+
4+
This tutorial shows how to evaluate a machine learning model using the Devo
5+
Platform's very own correlation engine:
6+
`Flow <https://docs.devo.com/space/latest/95213164/Flow>`_.
7+
8+
We will use the model developed in the tutorial
9+
:ref:`tutorials/entity-classification-kmeans-onnx:Entity classification (kmeans-ONNX)`
10+
to demonstrate how to classify new data automatically with the desired
11+
periodicity.
12+
13+
We will also learn how to subscribe to updates of the model itself in order
14+
to have the latest improvements of its training.
15+
16+
Requirements
17+
------------
18+
19+
* Devo table ``demo.ecommerce.data``.
20+
* ML model from the tutorial
21+
:ref:`entity classification <tutorials/entity-classification-kmeans-onnx:Entity classification (kmeans-ONNX)>`
22+
registered in the `Devo <https://www.devo.com>`_ platform.
23+
24+
Create the Flow
25+
---------------
26+
27+
One of the uses of `Flow` is the transformation and manipulation of data
28+
through a series of defined steps. The output of one step is the input to the
29+
next step(s) and so on. In this way we can define a complex data flow to achieve
30+
a desired goal. The set of steps to perform a task is known as `context`.
31+
32+
Each of steps in a context is represented by a `Unit <https://docs.devo.com/space/latest/95213765>`_.
33+
`Flow` provides a wide variety of units to perform lot of individual subtasks.
34+
35+
A relevant unit for our use case is the
36+
`ML Single Model Evaluator <https://docs.devo.com/space/latest/95214962/ML+Single+Model+Evaluator>`_
37+
which is capable of evaluating an ML model, so we need to create a context that
38+
includes this unit, in addition to others.
39+
40+
This is the layout of the context proposed for our task.
41+
42+
.. image:: /_static/EvalModelTemplate.png
43+
44+
We can
45+
`create the context from scratch <https://docs.devo.com/space/latest/95213200/Creating+a+Flow>`_,
46+
but there is an easier way to do it if we have the context specification, which
47+
is our case.
48+
49+
The :download:`EvalModelTemplate.json </_static/EvalModelTemplate.json>`
50+
specification contains all the specification to create the suitable context.
51+
We can import the context from ``Flow welcome page`` of `Flow` application
52+
using the option ``Import from json``, so download it and import it.
53+
54+
.. code-block::
55+
:caption: EvalModelTemplate.json
56+
57+
{
58+
"id": null,
59+
"description": "",
60+
"name": "EvalModelTemplate",
61+
"type": "app.Context",
62+
"templateFormat": "%mono",
63+
"loggedItems": [],
64+
"variableConf": {
65+
"variables": [{
66+
"id": "d310c38c-pa41-iebc-9cfa-d52483cffdbb",
67+
"name": "OUT_TABLE",
68+
"description": "Devo table to write to.",
69+
"type": "string",
70+
"value": "my.app.evalmodel.demo"
71+
},
72+
...
73+
}
74+
75+
.. note::
76+
77+
The specification of a context is usually very long, here it has been
78+
shortened for tutorial readability.
79+
80+
At this point we should have a ready-to-use context that reads data from the
81+
``demo.ecommerce.data`` table, evaluates it through an ML model and writes the
82+
results to the ``my.app.evalmodel.demo`` table, all just out of the box
83+
importing a context specification.
84+
85+
But a context does not have to be static, it can be adjusted by variables and
86+
its units are also configurable, let's see how.
87+
88+
Variables
89+
---------
90+
91+
Although the context is ready out of the box after import, we can adjust it by
92+
means of a set of variables
93+
94+
From ``File \ Variables`` option of the main menu of `Flow` application you
95+
can tune the variable values.
96+
97+
.. list-table:: Variable defaults
98+
:header-rows: 1
99+
100+
* - NAME
101+
- TYPE
102+
- DEFINITION
103+
- DEFAULT VALUE
104+
* - EVERY
105+
- ``String``
106+
- Cron expression to define how often queries will be launched
107+
- ``0 0 0 ? * * *``
108+
* - QUERY_DURATION
109+
- ``Long``
110+
- Length in milliseconds of the period the query should consider
111+
- ``86400000``
112+
* - QUERY_MODULE
113+
- ``Integer``
114+
- Number of chunks to split the query
115+
- ``24``
116+
* - QUERY_COLUMN
117+
- ``String``
118+
- Name of the query column used to split the query into chunks
119+
- ``clientIpAddress``
120+
* - OUT_TABLE
121+
- ``String``
122+
- Name of the table where results will be written
123+
- ``my.app.evalmodel.demo``
124+
125+
If you want to change the ``EVERY`` variable be aware that you must change the
126+
``QUERY_DURATION`` variable accordingly. For instance, if you want to run the
127+
query every 2 hours, ``QUERY_DURATION`` should be ``7200000``
128+
`(2h * 60min/h * 60s/min * 1000ms/s)`.
129+
130+
Units overview
131+
--------------
132+
133+
Let's see the units of the layout.
134+
135+
#. | **EveryX** [`event.Scheduler <https://docs.devo.com/space/latest/95214759/Scheduler>`_]
136+
137+
| This unit is configured to start working when the context is run.
138+
Everytime this unit sends an event, queries will be launched. To change
139+
how often queries should be launched, you can change the value of the Flow
140+
variable ``EVERY``. By default, it is configured to send an event everyday
141+
at midnight.
142+
143+
#. | **SetFromAndToDates** [`core.Map <https://docs.devo.com/space/latest/95214080/Map>`_]
144+
| This unit uses the date generated in the previous unit to set the ``from``
145+
and ``to`` timestamps for the query. The variable ``QUERY_DURATION`` sets
146+
the ``from`` timestamp. By default, it is configured to 24h.
147+
148+
#. | **MultiplyByNToSplitQuery** [`event.Repeater <https://docs.devo.com/space/latest/95214715/Repeater>`_]
149+
| This unit generates ``N`` events for every event received, all of them
150+
with the same content except for a ``counter``. We will use that counter
151+
to split the results of the query when there are too many values.
152+
153+
#. | **QueryData** [`devo.DevoFullQuery <https://docs.devo.com/space/latest/95214466/Devo+Full+Query>`_]
154+
| This is the unit that gets data from the table ``demo.ecommerce.data``
155+
using the Query Engine and the dates calculated in the **SetFromAndToDates**
156+
unit. We added a module operation to split the query into smaller queries
157+
just in case there are too many results to handle in one go.
158+
159+
| Note that this unit is connected to the **ModelEvaluator** unit that
160+
evaluates the input with a machine learning model, therefore the output of
161+
the unit must match the input fields definition of the ML model, i.e., same
162+
name and type. In our case, the model expects an array of five ``float4``
163+
elements named ``field_0``, so we built the query with that output:
164+
165+
.. code-block::
166+
:caption: QueryData::General::Query
167+
:emphasize-lines: 11-18
168+
169+
select ? as counter
170+
from demo.ecommerce.data where isnotnull(clientIpAddress)
171+
where (abs(hash(#{QUERY_COLUMN}))) %% #{QUERY_MODULE} = counter
172+
select
173+
hour(eventdate) as hour,
174+
minute(eventdate) as minute,
175+
second(eventdate) as second,
176+
clientIpAddress,
177+
userAgent
178+
group every 8h by clientIpAddress
179+
select
180+
str(clientIpAddress) as sourceIp,
181+
float4(size(collectcompact(hour))) as unique_hours,
182+
float4(size(collectcompact(minute))) as unique_mins,
183+
float4(size(collectcompact(second))) as unique_seconds,
184+
float4(size(collectcompact(userAgent))) as unique_user_agents,
185+
float4(avg(bytesTransferred)) as bytestransferred,
186+
[unique_hours, unique_mins, unique_seconds, unique_user_agents, bytestransferred] as field_0
187+
188+
#. | **CheckModelUpdates** [`ml.MlUpdateNotifier <https://docs.devo.com/space/latest/95215021/ML+Update+Notifier>`_]
189+
| This unit keeps your ML model updated. If you upload a new version of your
190+
model keeping the name, your context will refresh the model image to start
191+
using the new one.
192+
193+
| If you do not plan to update your model ever, you can remove this unit.
194+
195+
#. | **ModelEvaluator** [`ml.MlSingleModelEval <https://docs.devo.com/space/latest/95214962/ML+Single+Model+Evaluator>`_]
196+
| This unit evaluates the specified model with the input data and writes the
197+
result in a new field with the specified output name.
198+
199+
| The unit was pre-configured to use a model named ``entity_classification_ip``,
200+
which is the name of the model built in the
201+
:ref:`entity classification <tutorials/entity-classification-kmeans-onnx:Entity classification (kmeans-ONNX)>`
202+
tutorial. You can change the model to use in the ``Model Name`` parameter
203+
of the unit, but note that you must update the query accordingly to match
204+
the fields of both.
205+
206+
#. | **FormatEvents** [`core.Map <https://docs.devo.com/space/latest/95214080/Map>`_]
207+
| This unit *adds/removes/reformats* fields to get an event that fits the
208+
expected format in the output table.
209+
210+
#. | **InjectResults** [`devo.DevoSink <https://docs.devo.com/space/latest/95214553/Devo+Sink>`_]
211+
| Writes the enriched events back to Devo to ``OUT_TABLE`` table, defaults
212+
to ``my.app.evalmodel.demo``.
213+
214+
.. note::
215+
216+
Refer to `Flow documentation <https://docs.devo.com/space/latest/95213164/Flow>`_
217+
to learn more about Flow and their units.
218+
219+
Remove query splitting
220+
----------------------
221+
222+
In this tutorial we split the query into N smaller queries which is a good
223+
approach if you expect a large amounts of data. if it is not the case you can
224+
remove the splitting.
225+
226+
To do so in the **QueryData** unit remove the ``counter`` in `Parameters`
227+
section and modify the query by removing these lines:
228+
229+
.. code-block::
230+
:caption: QueryData::General::Query
231+
:emphasize-lines: 1, 3
232+
233+
select ? as counter
234+
from demo.ecommerce.data where isnotnull(clientIpAddress)
235+
where (abs(hash(#{QUERY_COLUMN}))) %% #{QUERY_MODULE} = counter
236+
select
237+
hour(eventdate) as hour,
238+
minute(eventdate) as minute,
239+
second(eventdate) as second,
240+
clientIpAddress,
241+
userAgent
242+
group every 8h by clientIpAddress
243+
select
244+
str(clientIpAddress) as sourceIp,
245+
float4(size(collectcompact(hour))) as unique_hours,
246+
float4(size(collectcompact(minute))) as unique_mins,
247+
float4(size(collectcompact(second))) as unique_seconds,
248+
float4(size(collectcompact(userAgent))) as unique_user_agents,
249+
float4(avg(bytesTransferred)) as bytestransferred,
250+
[unique_hours, unique_mins, unique_seconds, unique_user_agents, bytestransferred] as field_0
251+
252+
Without spliting you can still adjust the frequency of the query by changing the
253+
``EVERY`` and ``QUERY_DURATION`` variables according to your needs.

docs/source/tutorials/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,3 +12,4 @@ Manager Client in the process of development a ML model.
1212
dga-domain-classifier-keras-onnx
1313
dga-domain-scoring-gbm-onnx
1414
entity-classification-kmeans-onnx
15+
entity-classification-flow

0 commit comments

Comments
 (0)