Skip to content

Commit 53b2e8b

Browse files
Merge pull request #25 from DevoInc/feat/ecommerce-clustering-tutorial
Feat/ecommerce clustering tutorial
2 parents 7b4f324 + 33d2282 commit 53b2e8b

15 files changed

+1377
-287
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,10 @@ All notable changes to this project will be documented in this file.
44
The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
55
and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).
66

7+
## [1.3.0] - 2023-03-03
8+
### Added
9+
* Add ecommerce clustering tutorial.
10+
711
## [1.2.1] - 2023-02-27
812
### Fixed
913
* Remove DL4J and DEEPLOG from engines.py.

docs/source/conf.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@
3030
exclude_patterns = []
3131
autodoc_member_order = 'bysource'
3232
add_module_names = False
33+
autosectionlabel_prefix_document = True
3334

3435
# -- Options for HTML output -------------------------------------------------
3536
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output

docs/source/index.rst

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,9 +23,10 @@ That easy is to create a client:
2323
>>> from devo_ml.modelmanager import create_client_from_token
2424
>>> client = create_client_from_token("https://the_url", "the_token")
2525

26-
Here we use a :ref:`client factory <Create Client From Token>` to create a
27-
:ref:`client <Client Object>` object, this is a convenient
28-
way of create a client ready to use.
26+
Here we use a
27+
:ref:`client factory <user_guide/client-object:Create Client From Token>`
28+
to create a :ref:`client <user_guide/client-object:Client Object>` object,
29+
this is a convenient way of create a client ready to use.
2930

3031
Now we can start manage our models, e.g. to get a model called
3132
`pokemon_onnx_regression`:

docs/source/tutorials/dga-domain-classifier.rst

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ to learn more about domain generation algorithms.
1717

1818
We are going to develop a simple machine learning model capable of classifying
1919
malicious domains in the *demo.ecommerce.data* table. We will use the Machine
20-
Learning `H2O <https://h2o.ai/>`_ framework to build and train our model,
20+
Learning `H2O AI <https://h2o.ai/>`_ framework to build and train our model,
2121
the Machine Learning Model Manager Client to register it on the Devo platform,
2222
and the `Devo Python SDK <https://github.com/DevoInc/python-sdk>`_ framework to
2323
evaluate and classify in real time the domains in the table using a query.
@@ -156,10 +156,10 @@ Client.
156156
.. code-block::
157157
158158
# create the mlmm client
159-
client = create_client_from_token(DEVO_MLMM_URL, TOKEN)
159+
mlmm = create_client_from_token(DEVO_MLMM_URL, TOKEN)
160160
161161
# register the model
162-
client.add_model(
162+
mlmm.add_model(
163163
NAME,
164164
engines.H2O,
165165
os.path.join(MODELS_PATH, f"{NAME}.zip"),
@@ -169,15 +169,15 @@ Client.
169169
170170
.. note::
171171

172-
Refer to :ref:`user's guide <User's Guide>` of this documentation to learn
172+
Refer to :ref:`user_guide/index:User's Guide` of this documentation to learn
173173
more about the ML Model Manager Client.
174174

175175
So far we have everything ready to exploit our model.
176176

177177
Classify domains
178178
----------------
179179

180-
One way to evaluate a model is is by querying a table and the
180+
One way to evaluate a model is by querying a table and the
181181
``mlevalmodel(...)`` operator available in the Devo query engine capable
182182
of evaluating machine learning models.
183183

Lines changed: 307 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,307 @@
1+
Ecommerce Clustering
2+
====================
3+
4+
This tutorial shows how to register and use a ML model in
5+
`ONNX <https://onnx.ai/>`_ format in the `Devo <https://www.devo.com>`_
6+
platform.
7+
8+
``ONNX`` is an open format to represent different machine learning models. There
9+
are many frameworks like *Pytorch*, *Libsvm*, *Keras*, *Mxnet*, *Tensorflow*,
10+
etc. whose models can be exported to ``ONNX``.
11+
12+
In this example we are going to use the
13+
`scikit-learn <https://scikit-learn.org/>`_ library to perform a classification
14+
task based on unsupervised training using the
15+
`kmeans <https://en.wikipedia.org/wiki/K-means_clustering>`_ clustering method.
16+
17+
Then we convert the model to an ``ONNX`` model before register and exploit it
18+
using the Devo query engine.
19+
20+
.. note::
21+
22+
`The tutorial is available as a Jupyter notebook
23+
<https://github.com/DevoInc/python-mlmodelmanager-client/blob/main/notebooks/ecommerce_clustering.ipynb>`_.
24+
25+
Build the model
26+
---------------
27+
28+
Let's start by installing the required packages.
29+
30+
.. code-block:: bash
31+
32+
$ pip install devo-sdk
33+
$ pip install devo-mlmodelmanager
34+
$ pip install scikit-learn
35+
$ pip install onnx
36+
$ pip install numpy
37+
$ pip install pandas
38+
$ pip install skl2onnx
39+
40+
Then the needed imports.
41+
42+
.. code-block::
43+
44+
import numpy as np
45+
import pandas as pd
46+
import onnx
47+
48+
from onnx import helper, TensorProto
49+
from onnx.tools import update_model_dims
50+
from sklearn.cluster import KMeans
51+
from skl2onnx import convert_sklearn, to_onnx
52+
from devo.api import Client, ClientConfig, JSON, SIMPLECOMPACT_TO_OBJ
53+
from devo_ml.modelmanager import
54+
55+
Declare some constants for convenience in the code.
56+
57+
.. code-block::
58+
59+
# A valid Devo access token
60+
TOKEN = '<your_token_here>'
61+
62+
# URL of Devo API, e.g. https://apiv2-us.devo.com/search/query/
63+
DEVO_API_URL = '<devo_api_url_here>'
64+
65+
# URL of Devo ML Model Manager, e.g. https://api-us.devo.com/mlmodelmanager/
66+
DEVO_MLMM_URL = '<devo_mlmm_url_here>'
67+
68+
# The domain to connect to, e.g. self
69+
DOMAIN = '<your_domain_here>'
70+
71+
# The name of the model
72+
NAME = 'ecommerce_cluster'
73+
74+
# The description of the models
75+
DESCRIPTION = 'Demo of ecommerce clustering'
76+
77+
# File to store the onnx model
78+
MODEL_FILE = f'{NAME}.onnx'
79+
80+
Our model will classify the IPs in the *demo.ecommerce* table into three
81+
supposed interest groups: IA, UA, MU.
82+
83+
.. note::
84+
85+
For simplicity it has been assumed that the number of clusters is 3.
86+
It is appropriate to use other methods such as the
87+
`Elbow method <https://en.wikipedia.org/wiki/Elbow_method_(clustering)>`_
88+
or the
89+
`Silhouette method <https://en.wikipedia.org/wiki/Silhouette_(clustering)>`_
90+
to determine the optimal number of clusters in a dataset.
91+
92+
To build and train the model we are going to use the existing data in the table
93+
itself, and to get it we need to create an access to the Devo API to be able to
94+
launch queries.
95+
96+
With the `Devo Python SDK <https://github.com/DevoInc/python-sdk>`_,
97+
among other features, we can execute queries against the Devo platform easily
98+
and securely.
99+
100+
.. code-block::
101+
102+
# create a Devo API client
103+
api = Client(
104+
auth={"token": TOKEN},
105+
address=DEVO_API_URL,
106+
config=ClientConfig(
107+
response="json/simple/compact",
108+
stream=True,
109+
processor=SIMPLECOMPACT_TO_OBJ
110+
)
111+
)
112+
113+
.. note::
114+
115+
Refer to `Query API <https://docs.devo.com/space/latest/95128275>`_
116+
to learn more about the Devo Query API.
117+
118+
Now we can extract and prepare the data for our model.
119+
120+
.. code-block::
121+
122+
query = '''from demo.ecommerce.data where isnotnull(clientIpAddress)
123+
select
124+
hour(eventdate) as hour,
125+
minute(eventdate) as minute,
126+
second(eventdate) as second,
127+
clientIpAddress,
128+
userAgent
129+
group every 8h by clientIpAddress
130+
select
131+
str(clientIpAddress) as sourceIp,
132+
float4(size(collectcompact(hour))) as unique_hours,
133+
float4(size(collectcompact(minute))) as unique_mins,
134+
float4(size(collectcompact(second))) as unique_seconds,
135+
float4(size(collectcompact(userAgent))) as unique_user_agents,
136+
float4(avg(bytesTransferred)) as bytestransferred
137+
'''
138+
139+
response = api.query(
140+
query=query,
141+
dates={'from': 'today() - 2 * day()', 'to': 'today() - 1 * day()'}
142+
)
143+
144+
raw_data = pd.DataFrame(
145+
response,
146+
columns=[
147+
'sourceIp',
148+
'unique_hours',
149+
'unique_mins',
150+
'unique_seconds',
151+
'unique_user_agents',
152+
'bytestransferred',
153+
],
154+
)
155+
156+
.. note::
157+
158+
Refer to `Build a query using LINQ
159+
<https://docs.devo.com/space/latest/95191261/Build+a+query+using+LINQ>`_
160+
to learn more about queries.
161+
162+
With the data already prepared, it is time to create and train the model. We
163+
use the
164+
`sklearn.cluster.KMeans <https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans>`_
165+
class.
166+
167+
.. code-block::
168+
169+
train_data = raw_data.select_dtypes(include=np.number).to_numpy()
170+
171+
model = KMeans(
172+
n_clusters=3,
173+
init='k-means++',
174+
verbose=0,
175+
max_iter=300,
176+
random_state=42
177+
).fit(train_data)
178+
179+
Let's now transform the model to ``ONNX`` format.
180+
181+
For that we use the
182+
`skl2onnx.to_onnx <https://onnx.ai/sklearn-onnx/api_summary.html?highlight=to_onnx#skl2onnx.to_onnx>`_
183+
function of the `sklearn-onnx <https://onnx.ai/sklearn-onnx/index.html>`_ library.
184+
185+
.. code-block::
186+
187+
model_onnx = to_onnx(
188+
model,
189+
train_data.astype(np.float32),
190+
target_opset=13,
191+
)
192+
193+
Some transformations need to be made to the model output in order to be able to
194+
properly consume it in the Devo platform.
195+
196+
.. code-block::
197+
198+
# Output: scores (discarded)
199+
_ = model_onnx.graph.output.pop(1)
200+
201+
# Output: label (discarded)
202+
_ = model_onnx.graph.output.pop(0)
203+
204+
# Last output should be float to work in Devo
205+
cast_node = helper.make_node(
206+
'Cast',
207+
inputs=['label'],
208+
outputs=['label_cast'],
209+
name='output_label_cast',
210+
to=TensorProto.FLOAT,
211+
)
212+
model_onnx.graph.node.append(cast_node)
213+
model_onnx.graph.output.append(
214+
helper.make_tensor_value_info(
215+
name='label_cast',
216+
elem_type=TensorProto.FLOAT,
217+
shape=[-1],
218+
)
219+
)
220+
221+
# Expand last dimension, so it has two dimensions: batch and item
222+
# It's required only for the kmeans in sklearn, other algorithms
223+
# like linear regression do not require this conversion
224+
model_onnx = onnx.compose.expand_out_dim(model_onnx, dim_idx=1)
225+
model_onnx = update_model_dims.update_inputs_outputs_dims(
226+
model_onnx,
227+
{'X': [-1, 5]},
228+
{'label_cast': [-1, 1]},
229+
)
230+
231+
Finally we save the model in a file.
232+
233+
.. code-block::
234+
235+
with open(MODEL_FILE, 'wb') as fp:
236+
fp.write(model_onnx.SerializeToString())
237+
238+
Register the model
239+
------------------
240+
241+
Once the model has been converted and saved, it must be registered on the
242+
Devo platform in order to exploit it. For this we will use the ML Model Manager
243+
Client.
244+
245+
.. code-block::
246+
247+
# create the mlmm client
248+
mlmm = create_client_from_token(DEVO_MLMM_URL, TOKEN)
249+
250+
# register the model
251+
mlmm.add_model(
252+
NAME,
253+
engines.ONNX,
254+
MODEL_FILE,
255+
description=DESCRIPTION,
256+
force=True
257+
)
258+
259+
Classifying
260+
-----------
261+
262+
We use ``mlevalmodel(...)`` operator available in the Devo query engine capable
263+
of evaluating machine learning models to classify IPs with the previous model.
264+
265+
A query that might be worthwhile would be something like this.
266+
267+
.. code-block::
268+
269+
query = f'''from demo.ecommerce.data where isnotnull(clientIpAddress)
270+
select
271+
hour(eventdate) as hour,
272+
minute(eventdate) as minute,
273+
second(eventdate) as second,
274+
clientIpAddress,
275+
userAgent
276+
group every 8h by clientIpAddress
277+
select
278+
str(clientIpAddress) as sourceIp,
279+
float4(size(collectcompact(hour))) as unique_hours,
280+
float4(size(collectcompact(minute))) as unique_mins,
281+
float4(size(collectcompact(second))) as unique_seconds,
282+
float4(size(collectcompact(userAgent))) as unique_user_agents,
283+
float4(avg(bytesTransferred)) as bytestransferred,
284+
at(mlevalmodel(
285+
"{DOMAIN}",
286+
"{NAME}",
287+
[unique_hours, unique_mins, unique_seconds, unique_user_agents, bytestransferred]), 0) as label,
288+
ifthenelse(label = 0.0, "IU", ifthenelse(label = 1.0, "AU", "MU")) as type
289+
'''
290+
291+
Using the `api` access to Devo previously created we can fetch the results.
292+
293+
.. code-block::
294+
295+
response = api.query(
296+
query=query,
297+
dates={'from': 'today() - 1 * day()'}
298+
)
299+
300+
for row in response:
301+
print("IP:", row['sourceIp'], "type", row['type'])
302+
303+
.. note::
304+
305+
The intention of the tutorial is only to demonstrate how to convert a model
306+
to ``ONNX`` and upload it to the DEVO platform, not to create a valid
307+
and optimal clustering model.

docs/source/tutorials/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,4 @@ Manager Client in the process of development a ML model.
99
:caption: Contents
1010

1111
dga-domain-classifier
12+
ecommerce_clustering

docs/source/user_guide/authentication.rst

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,10 @@ request, perform an action to authenticate that request and return the
77
authenticated request.
88

99
An authenticated request must have a valid token in an HTML header with one of
10-
the supported authentication methods; :ref:`StandAloneToken scheme` or
11-
:ref:`Bearer scheme`. The library provides implementations for both
12-
authentication methods.
10+
the supported authentication methods;
11+
:ref:`user_guide/authentication:StandAloneToken scheme` or
12+
:ref:`user_guide/authentication:Bearer scheme`. The library provides
13+
implementations for both authentication methods.
1314

1415
.. note::
1516

0 commit comments

Comments
 (0)