diff --git a/site/content/3.13/aql/functions/vector.md b/site/content/3.13/aql/functions/vector.md index e6a6de0134..45341b8ea8 100644 --- a/site/content/3.13/aql/functions/vector.md +++ b/site/content/3.13/aql/functions/vector.md @@ -12,7 +12,7 @@ To use vector search, you need to have vector embeddings stored in documents and the attribute that stores them needs to be indexed by a [vector index](../../index-and-search/indexing/working-with-indexes/vector-indexes.md). -You can calculate vector embeddings using [ArangoDB's GraphML](../../data-science/arangographml/_index.md) +You can calculate vector embeddings using [ArangoDB's GraphML](../../data-science/graphml/_index.md) capabilities (available in ArangoGraph) or using external tools. {{< warning >}} diff --git a/site/content/3.13/data-science/_index.md b/site/content/3.13/data-science/_index.md index 8655091987..55d9602ec5 100644 --- a/site/content/3.13/data-science/_index.md +++ b/site/content/3.13/data-science/_index.md @@ -1,6 +1,6 @@ --- -title: Data Science -menuTitle: Data Science +title: Data Science and GenAI +menuTitle: Data Science & GenAI weight: 115 description: >- ArangoDB lets you apply analytics and machine learning to graph data at scale @@ -69,7 +69,7 @@ GraphML can answer questions like: ![Graph ML](../../images/graph-ml.png) For ArangoDB's enterprise-ready, graph-powered machine learning offering, -see [ArangoGraphML](arangographml/_index.md). +see [ArangoGraphML](graphml/_index.md). ## Use Cases diff --git a/site/content/3.13/data-science/arangographml/deploy.md b/site/content/3.13/data-science/arangographml/deploy.md deleted file mode 100644 index 2fa947b758..0000000000 --- a/site/content/3.13/data-science/arangographml/deploy.md +++ /dev/null @@ -1,76 +0,0 @@ ---- -title: Deploy ArangoGraphML -menuTitle: Deploy -weight: 5 -description: >- - You can deploy ArangoGraphML in your own Kubernetes cluster or use the managed - cloud service that comes with a ready-to-go, pre-configured environment ---- - -## Managed cloud service versus self-managed - -ArangoDB offers two deployment options, tailored to suit diverse requirements -and infrastructure preferences: -- Managed cloud service via the [ArangoGraph Insights Platform](https://dashboard.arangodb.cloud/home?utm_source=docs&utm_medium=cluster_pages&utm_campaign=docs_traffic) -- Self-managed solution via the [ArangoDB Kubernetes Operator](https://github.com/arangodb/kube-arangodb) - -### ArangoGraphML - -ArangoGraphML provides enterprise-ready Graph Machine Learning as a -Cloud Service via Jupyter Notebooks that run on the -[ArangoGraph Insights Platform](https://dashboard.arangodb.cloud/home?utm_source=docs&utm_medium=cluster_pages&utm_campaign=docs_traffic). - -{{< tip >}} -To get access to ArangoGraphML services and packages, -[get in touch](https://www.arangodb.com/contact/) -with the ArangoDB team. -{{< /tip >}} - -- **Accessible at all levels** - - Low code UI - - Notebooks - - APIs -- **Full usability** - - MLOps lifecycle - - Metrics - - Metadata capture - - Model management - -![ArangoGraphML Pipeline](../../../images/ArangoGraphML_Pipeline.png) - -#### Setup - -The ArangoGraphML managed-service runs on the -[ArangoGraph Insights Platform](https://dashboard.arangodb.cloud/home?utm_source=docs&utm_medium=cluster_pages&utm_campaign=docs_traffic). -It offers a pre-configured environment where everything, -including necessary components and configurations, comes preloaded. You don't -need to set up or configure the infrastructure, and can immediately start using the -GraphML functionalities. - -To summarize, all you need to do is: -1. Sign up for an [ArangoGraph account](https://dashboard.arangodb.cloud/home?utm_source=docs&utm_medium=cluster_pages&utm_campaign=docs_traffic). -2. Create a new [deployment in ArangoGraph](../../arangograph/deployments/_index.md#how-to-create-a-new-deployment). -3. Start using the ArangoGraphML functionalities. - -### Self-managed ArangoGraphML - -The self-managed solution enables you to deploy and manage ArangoML within your -Kubernetes cluster using the [ArangoDB Kubernetes Operator](https://github.com/arangodb/kube-arangodb). - -The self-managed package includes the same features as in ArangoGraphML. -The primary distinction lies in the environment setup: with the self-managed -solution, you have direct control over configuring your environment. - -#### Setup - -You can run ArangoGraphML in your Kubernetes -cluster provided you already have a running `ArangoDeployment`. If you don't -have one yet, consider checking the installation guide of the -[ArangoDB Kubernetes Operator](https://arangodb.github.io/kube-arangodb/docs/using-the-operator.html) -and the [ArangoDeployment Custom Resource](https://arangodb.github.io/kube-arangodb/docs/deployment-resource-reference.html) -description. - -To start ArangoGraphML in your Kubernetes cluster, follow the instructions provided -in the [ArangoMLExtension Custom Resource](https://arangodb.github.io/kube-arangodb/docs/mlextension-resource.html) -description. Once the `CustomResource` has been created and the ArangoGraphML extension -is ready, you can start using it. \ No newline at end of file diff --git a/site/content/3.13/data-science/arangographml/ui.md b/site/content/3.13/data-science/arangographml/ui.md deleted file mode 100644 index bbe6f9df88..0000000000 --- a/site/content/3.13/data-science/arangographml/ui.md +++ /dev/null @@ -1,264 +0,0 @@ ---- -title: GraphML -menuTitle: GraphML -weight: 15 -description: >- - Learn how to create, configure, and run a full machine learning workflow for ArangoGraphML using the steps and features in the ArangoDB web interface ---- -Solve high-computational graph problems with Graph Machine Learning. Apply ML on a selected graph to predict connections, get better product recommendations, classify nodes, and perform node embeddings. Configure and run the whole machine learning flow entirely in the web interface. - -## What You Can Do with GraphML - -GraphML directly supports two primary machine learning tasks: - -* **Node Classification:** Automatically assign a label to nodes in your graph. For example, you can classify customers as "likely to churn" or "high value," or identify fraudulent transactions. -* **Node Embeddings:** Generate numerical representations (vectors) for each node. The main purpose of these embeddings is to measure similarity: nodes that are alike are placed close together in a high-dimensional space. By comparing the embeddings of two nodes, you can calculate their similarity. This is the foundation for advanced tasks like link prediction (predicting a future connection), building recommendation engines, and finding similar items. - -## How ArangoGraphML Works - -The underlying framework for ArangoGraphML is **[GraphSAGE](https://snap.stanford.edu/graphsage/)**. GraphSAGE (Graph Sample and AggreGatE) is a powerful Graph Neural Network (GNN) **framework** designed for inductive representation learning on large graphs. It is used to generate low-dimensional vector representations for nodes and is especially useful for graphs that have rich node attribute information. The overall process involves two main stages: - -1. **Featurization**: Your raw graph data is transformed into numerical representations that the model can understand. - * The system iterates over your selected vertices and converts their attributes: booleans become `0` or `1`, numbers are normalized, and text attributes are converted into numerical vectors using sentence transformers. - * All of these numerical features are then combined (concatenated). - * Finally, **Incremental PCA** (Incremental Principal Component Analysis a dimensionality reduction technique) is used to reduce the size of the combined features, which helps remove noise and keep only the most important information. - -2. **Training**: The model learns from the graph's structure by sampling and aggregating information from each node's local neighborhood. - * For each node, GraphSAGE looks at connections up to **2 hops away**. - * Specifically, it uniformly samples up to **25 direct neighbors** (depth 1) and for each of those, it samples up to **10 of their neighbors** (depth 2). - * By aggregating feature information from this sampled neighborhood, the model creates a rich "embedding" for each node that captures both its own features and its role in the graph. - -## Limitations - -* **Edge Attributes**: The current version of ArangoGraphML does not support the use of edge attributes as features. -* **Dangling Edges**: Edges that point to non-existent vertices ("dangling edges") are not caught during the featurization analysis. They may cause errors later, during the Training phase. -* **Prediction Scope**: Predictions can only be run in batches on the graph. It is not possible to request a prediction for a single document on-the-fly (Example, via an AQL query). -* **Memory Usage**: Both featurization and training can be memory-intensive. Out-of-memory errors can occur on large graphs with insufficient system resources. - -## The GraphML Workflow - -The entire process is organized into sequential steps within a **Project**, giving you a clear path from data to prediction: - -1. **Featurization:** Select your data and convert it into numerical features. -2. **Training:** Train a GraphSAGE model on the features and graph structure. -3. **Model Selection:** Evaluate the trained models and choose the best one. -4. **Prediction:** Use the selected model to generate predictions on your data. -5. **Scheduling (Optional):** Automate the prediction process to run at regular intervals. - -## Creating a GraphML Project - -To create a new GraphML project using the ArangoDB Web Interface, follow these steps: - -1. From the left-hand sidebar, select the database where you want to create the project. -2. In the left-hand navigation menu, click **Data Science Suite** to open the GraphML project management interface, then click Run GraphML. - ![Navigate to Data Science](../../../images/datascience-intro.jpg) -3. In the **GraphML projects** view, click **Add new project**. -4. The **Create ML project** modal opens. Enter a **Name** for your machine learning project. -5. Click the **Create project** button to finalize the creation. -6. After creation, the new project appears in the list under **GraphML projects**. Click the project name to begin with a Featurization job. - -## Featurization Phase - -After clicking on a project name, you are taken to a screen where you can configure and start a new Featurization job. Follow these steps: -- **Select a Graph** – In the **Features** section, choose your target graph from the **Select a graph** dropdown. -- **Choose Vertex Collections** – Pick the vertex collections that you want to include for feature extraction. -- **Select Attributes** – Choose the attributes from your vertex collection to convert into machine-understandable features. - -{{< info >}} -Attributes cannot be used if their values are lists or arrays. -{{< /info >}} - -{{< info >}} -A metagraph is basically just the configured subset of a graph (the vertex and edge collections and the specified attributes). This is what you see represented in the metagraph object in the JSON specification on the right. -{{< /info >}} - -The featurization process has several configurable options, grouped into Configuration and Advanced settings. These are also shown in a JSON format on the right side of the screen for transparency. - -**Configuration** -These settings control the overall featurization job and how features are stored. - -- **Batch size** – The number of documents to process in a single batch. -- **Run analysis checks** – Whether to run analysis checks to perform a high-level analysis of the data quality before proceeding. Default is `true`. -- **Skip labels** – Skip the featurization process for attributes marked as labels. Default is `false`. - -**Feature Storage Options** -These settings control where the generated features are stored. - -- **Overwrite FS graph** – Whether to overwrite the Feature Store graph if features were previously generated. Default is `false`, therefore features are written to an existing graph Featere store graph. -- **Write to source graph** – Whether to store the generated features on the Source Graph. Default is `true`. -- **Use feature store** – Enable the use of the Feature Store database, which allows you to store features separately from your Source Database. Default is `false`, therefore feature are written to the source graph. - -**Advanced Settings: Handling Imperfect Data** - -Real-world data is often messy. It can have missing values or mismatched data types. This section allows you to define default rules for how the featurization process should handle these data quality issues for each feature type, preventing the job from failing unexpectedly. - -For each feature type (Text, Numeric, Category, and Label), you can set two types of strategies: - -**Missing Strategy:** Defines what to do when an attribute is completely missing from a document. - -**Mismatch Strategy:** Defines what to do when an attribute exists but has the wrong data type. - -**Missing Value Strategies** - -**Raise:** The job immediately stops and reports an error if a data type mismatch is found. Use this when any type mismatch indicates a critical data error. - -**Replace:** Replaces a missing value with a default you provide (e.g., 0 for numbers, "unknown" for text). The job then continues. Use this when missing values are expected. - -**Mismatch Value Strategies** - -**Raise:** The strictest option. The job will immediately stop and report an error if a data type mismatch is found. Use this when any type mismatch indicates a critical data error. - -**Replace:** If a mismatch is found, the system immediately replaces the value with the default you specify, without attempting to convert it first. Use this when you don't trust the mismatched data and prefer to substitute it directly. - -**Coerce and raise:** A balanced approach. The system first tries to convert (coerce) the value to the correct type (e.g., string "123" to number 123). If the conversion is successful, it uses the new value. If it fails, the job stops. This is often the best default strategy. - -**Coerce and replace:** The most forgiving option. The system first tries to convert the value. If the conversion fails, it replaces the value with the default you specify and continues the job. Use this for very "dirty" datasets where completing the job is the highest priority. - -Once all selections are done, click the **Begin featurization** button. This triggers a **node embedding-compatible featurization job**. Once the job status changes to **Ready for training**, you can start the **ML Training** step. - -![Navigate to Featurization](../../../images/graph-ml-ui-featurization.png) - -## Training Phase - -The training is the second step in the ML workflow after featurization. In the training phase, you configure and launch a machine learning training job on your graph data. - - -- **Select a training job type** – From the **Select a type of training job** dropdown, choose the type of model you want to train (Example, Node Classification, Node Embedding). - - -#### Node Classification - -Node Classification is used to categorize the nodes in your graph based on their features and structural connections within the graph. - -**Use cases include:** -- Entity categorization (Example, movies into genres, users into segments) -- Fraud detection in transaction networks - -**Configuration Parameters:** -- **Type of Training Job:** Node classification -- **Target Vertex Collection:** Choose the collection to classify (Example, `movie`) -- **Batch Size:** The nummer of documents processed in a single training iteration. (Example, 256) -- **Data Load Batch Size:** The number of documents loaded from ArangoDB into memory in a single batch during the data loading phase. (Example, 50000) -- **Data Load Parallelism:** The number of parallel processes used when loading data from ArangoDB into memory for trainnig. (Example, 10) - -After setting these values, click the **Begin training** button to start the job. - -![Node Classification](../../../images/ml-nodeclassification.png) - - #### Node Embedding - -Node Embedding is used to generate vector embeddings (dense numerical representations) of graph nodes that capture structural and feature-based information. - -**Use cases include:** -- Similarity search (Example, finding similar products, users, or documents) -- Link prediction (Example, suggesting new connections) - -**Configuration Parameters:** -- **Type of Training Job:** Node embeddings -- **Target Vertex Collection:** Select the collection to generate embeddings for (Example, `movie` or `person`) -- No label is required for training in this mode - -Once the configuration is complete, click **Begin training** to launch the embedding job. - -![Node Embeddings](../../../images/ml-node-embedding.png) - - -After training is complete, the next step in the ArangoGraphML workflow is **Model Selection**. - -## Model Selection Phase - -Once the training is finished, the job status updates to **READY FOR MODEL SELECTION**. This means the model has been trained using the provided vertex and edge data and is now ready for evaluation. - -**Understanding Vertex Collections:** - -**X Vertex Collection:** These are the source nodes used during training. They represent the full set of nodes on which features were computed (Example, person, movie). - -**Y Vertex Collection:** These are the target nodes that contain labeled data. The labels in this collection are used to supervise the training process and are the basis for evaluating prediction quality. - -The target collection is where the model's predictions are stored when running a prediction job. - -**Model Selection Interface:** - -A list of trained models is displayed, along with performance metrics (Accuracy, Precision, Recall, F1 score, Loss). -Review the results of different model runs and configurations. - -Select the best performing model suitable for your prediction task. - -![Model Selection](../../../images/graph-ml-model.png) - -## Prediction Phase - -After selecting a model, you can create a Prediction Job. The Prediction Job generates predictions and persists them to the source graph, either in a new collection or within the source documents. - -### Overview - -The Prediction interface allows inference to be run using the selected model. It enables configuration of how predictions are executed, which collections are involved, and whether new or outdated documents should be automatically featurized before prediction. - -You have two important options for this: - -**Featurize new documents:** Enable this option to generate features for documents that have been added since the model was trained. This is useful for getting predictions on new data without having to retrain the model. - -**Featurize outdated documents:** Enable this option to re-generate features for documents that have been modified since the last featurization. This ensures your predictions reflect the latest changes to your data. -In addition to these settings, you will also define the target data, where to store results, and whether to run the job on a recurring schedule. - -In addition to these settings, you also define the target data, where to store results, and whether to run the job on a recurring schedule. - -![prediction phase](../../../images/graph-prediction.png) - -### Configuration Options -The Prediction screen displays the following configuration options: - -- Selected Model: Displays the model selected during the Model Selection phase. This model will be used to perform inference. - -- Target Vertex Collection: This is the vertex collection on which predictions are applied. - -- Prediction Type: Depending on the training job (for example, classification or embedding), the prediction outputs class labels or updated embeddings. - -### Featurization Settings -Two toggles are available to control automatic featurization during prediction - -**Featurize New Documents:** -This option controls whether newly added documents are automatically featurized. It is useful when new data arrives after training, allowing predictions to continue without requiring a full retraining process. - -**Featurize Outdated Documents:** -Enable or disable the featurization of outdated documents. Outdated documents are those whose attributes (used during featurization) have changed since the last feature computation. This ensures prediction results are based on up-to-date information. - -These options provide flexibility in handling dynamic graph data and keeping predictions relevant without repeating the entire ML workflow. - -**Data load batch size** – Specifies the number of documents to load in a single batch (Example, 500000). - -**Data load parallelism** – Number of parallel threads used to process the prediction workload (Example, 10). - -**Prediction field** – The field in the documents where the predicted values are stored. - -### Enable Scheduling - -You can configure automatic predictions using the **Enable scheduling** checkbox. - -When scheduling is turned on, predictions run automatically based on a set CRON expression. This helps keep prediction results up to date as new data is added to the system. - -#### Schedule (CRON expression) - -You can define a CRON expression that sets when the prediction job should run. For example: -0 0 1 1 * -This CRON pattern executes the prediction every year on January 1st at 00:00. - -Below the CRON field, a user-friendly scheduling interface helps translate it: - -- **Period**: Options include *Hourly*, *Daily*, *Weekly*, *Monthly*, or *Yearly*. -- **Month**: *(Example, January)* -- **Day of Month**: *(Example, 1)* -- **Day of Week**: *(optional)* -- **Hours and Minutes**: Set the exact time for execution *(Example, 0:00)* - - -### Execute Prediction -After reviewing the configuration, click the Run Prediction button. ArangoGraphML then: - -- Perform featurization - -- Run inference using the selected model - -- Write prediction results into the target vertex collection or a specified output location - -Once prediction is complete, you can analyze the results directly in the Web Interface or export them for downstream use. diff --git a/site/content/3.13/data-science/arangographml/_index.md b/site/content/3.13/data-science/graphml/_index.md similarity index 70% rename from site/content/3.13/data-science/arangographml/_index.md rename to site/content/3.13/data-science/graphml/_index.md index e8d6ea4137..ba8e2d6c46 100644 --- a/site/content/3.13/data-science/arangographml/_index.md +++ b/site/content/3.13/data-science/graphml/_index.md @@ -1,18 +1,23 @@ --- -title: ArangoGraphML -menuTitle: ArangoGraphML +title: ArangoDB GraphML +menuTitle: GraphML weight: 125 description: >- - Enterprise-ready, graph-powered machine learning as a cloud service or self-managed + Boost your machine learning models with graph data using ArangoDB's advanced GraphML capabilities aliases: - - graphml + - arangographml --- Traditional Machine Learning (ML) overlooks the connections and relationships between data points, which is where graph machine learning excels. However, accessibility to GraphML has been limited to sizable enterprises equipped with -specialized teams of data scientists. ArangoGraphML simplifies the utilization of GraphML, +specialized teams of data scientists. ArangoDB simplifies the utilization of Graph Machine Learning, enabling a broader range of personas to extract profound insights from their data. +With ArangoDB, you can solve high-computational graph problems using Graph Machine +Learning. Apply it on a selected graph to predict connections, get better product +recommendations, classify nodes, and perform node embeddings. You can configure and run +the whole machine learning flow entirely through the web interface or programmatically. + ## How GraphML works Graph machine learning leverages the inherent structure of graph data, where @@ -21,18 +26,29 @@ traditional ML, which primarily operates on tabular data, GraphML applies specialized algorithms like Graph Neural Networks (GNNs), node embeddings, and link prediction to uncover complex patterns and insights. +The underlying framework for ArangoDB's GraphML is **[GraphSAGE](https://snap.stanford.edu/graphsage/)**. +GraphSAGE (Graph Sample and AggreGatE) is a powerful Graph Neural Network (GNN) +**framework** designed for inductive representation learning on large graphs. +It is used to generate low-dimensional vector representations for nodes and is +especially useful for graphs that have rich node attribute information. +The overall process involves the following steps: + 1. **Graph Construction**: - Raw data is transformed into a graph structure, defining nodes and edges based + - Raw data is transformed into a graph structure, defining nodes and edges based on real-world relationships. -2. **Featurization**: - Nodes and edges are enriched with features that help in training predictive models. -3. **Model Training**: - Machine learning techniques are applied on GNNs to identify patterns and make predictions. +2. **Featurization**: Your raw graph data is transformed into numerical representations that the model can understand. + - The system iterates over your selected vertices and converts their attributes: booleans become `0` or `1`, numbers are normalized, and text attributes are converted into numerical vectors using sentence transformers. + - All of these numerical features are then combined (concatenated). + - Finally, **Incremental PCA** (Incremental Principal Component Analysis a dimensionality reduction technique) is used to reduce the size of the combined features, which helps remove noise and keep only the most important information. +3. **Training**: The model learns from the graph's structure by sampling and aggregating information from each node's local neighborhood. + - For each node, GraphSAGE looks at connections up to **2 hops away**. + - Specifically, it uniformly samples up to **25 direct neighbors** (depth 1) and for each of those, it samples up to **10 of their neighbors** (depth 2). + - By aggregating feature information from this sampled neighborhood, the model creates a rich "embedding" for each node that captures both its own features and its role in the graph. 4. **Inference & Insights**: - The trained model is used to classify nodes, detect anomalies, recommend items, + - The trained model is used to classify nodes, detect anomalies, recommend items, or predict future connections. -ArangoGraphML streamlines these steps, providing an intuitive and scalable +ArangoDB streamlines these steps, providing an intuitive and scalable framework to integrate GraphML into various applications, from fraud detection to recommendation systems. @@ -40,16 +56,18 @@ to recommendation systems. ![GraphML Workflow](../../../images/GraphML-How-it-works.webp) -It is no longer necessary to understand the complexities involved with graph -machine learning, thanks to the accessibility of the ArangoML package. -Solutions with ArangoGraphML only require input from a user about -their data, and the ArangoGraphML managed service handles the rest. +You no longer need to understand the complexities of graph machine learning to +benefit from it. Solutions with ArangoDB's GraphML only require input from a user about +their data, and the GraphML managed service handles the rest. The platform comes preloaded with all the tools needed to prepare your graph for machine learning, high-accuracy training, and persisting predictions back to the database for application use. -## Supported Tasks +## What you can do with GraphML + +GraphML directly supports two primary machine learning tasks: +**Node Classification** and **Node Embeddings**. ### Node Classification @@ -58,7 +76,7 @@ predict the label of a node based on both its own features and its relationships within the graph. It requires a set of labeled nodes to train a model, which then classifies unlabeled nodes based on learned patterns. -**How it works in ArangoGraphML** +**How it works in ArangoDB** - A portion of the nodes in a graph is labeled for training. - The model learns patterns from both **node features** and @@ -97,7 +115,7 @@ into numerical vector representations, preserving their **structural relationshi within the graph. Unlike simple feature aggregation, node embeddings **capture the influence of neighboring nodes and graph topology**, making them powerful for downstream tasks like clustering, anomaly detection, -and link prediction. These combinations can provide valuable insights. +and link prediction. This combination provides valuable insights. Consider using [ArangoDB's Vector Search](https://arangodb.com/2024/11/vector-search-in-arangodb-practical-insights-and-hands-on-examples/) capabilities to find similar nodes based on their embeddings. @@ -116,7 +134,7 @@ Essentially, they aggregate both the node's attributes and the connectivity patt within the graph. This fusion helps capture not only the individual properties of a node but also its position and role within the network. -**How it works in ArangoGraphML** +**How it works in ArangoDB** - The model learns an embedding (a vector representation) for each node based on its **position within the graph and its connections**. @@ -161,21 +179,21 @@ a node but also its position and role within the network. | **Key Advantage** | Learns labels based on node connections and attributes | Learns structural patterns and node relationships | | **Use Cases** | Fraud detection, customer segmentation, disease classification | Recommendations, anomaly detection, link prediction | -ArangoGraphML provides the infrastructure to efficiently train and apply these +GraphML provides the infrastructure to efficiently train and apply these models, helping users extract meaningful insights from complex graph data. ## Metrics and Compliance -ArangoGraphML supports tracking your ML pipeline by storing all relevant metadata +GraphML supports tracking your ML pipeline by storing all relevant metadata and metrics in a Graph called ArangoPipe. This is only available to you and is never viewable by ArangoDB. This metadata graph links all experiments to the source data, feature generation activities, training runs, and prediction jobs, allowing you to track the entire ML pipeline without having to leave ArangoDB. -### Security +## Security -Each deployment that uses ArangoGraphML has an `arangopipe` database created, +Each deployment that uses GraphML has an `arangopipe` database created, which houses all ML Metadata information. Since this data lives within the deployment, it benefits from the ArangoGraph security features and SOC 2 compliance. -All ArangoGraphML services live alongside the ArangoGraph deployment and are only +All GraphML services live alongside the ArangoGraph deployment and are only accessible within that organization. diff --git a/site/content/3.13/data-science/arangographml/getting-started.md b/site/content/3.13/data-science/graphml/notebooks-api.md similarity index 79% rename from site/content/3.13/data-science/arangographml/getting-started.md rename to site/content/3.13/data-science/graphml/notebooks-api.md index 6bd614167e..c9ade4cacf 100644 --- a/site/content/3.13/data-science/arangographml/getting-started.md +++ b/site/content/3.13/data-science/graphml/notebooks-api.md @@ -1,29 +1,39 @@ --- -title: Getting Started with ArangoGraphML -menuTitle: Getting Started -weight: 10 +title: How to use GraphML in a scriptable manner +menuTitle: Notebooks & API +weight: 15 description: >- - How to control all resources inside ArangoGraphML in a scriptable manner + Control all resources inside GraphML via Jupyter Notebooks or Python API aliases: - getting-started-with-arangographml + - ../arangographml/getting-started + - ../arangographml-getting-started-with-arangographml --- -ArangoGraphML provides an easy-to-use & scalable interface to run Graph Machine Learning on ArangoDB Data. Since all of the orchestration and ML logic is managed by ArangoGraph, all that is typically required are JSON specifications outlining individual processes to solve an ML Task. If you are using the self-managed solution, additional configurations may be required. -The `arangoml` is a Python Package allowing you to manage all of the necessary ArangoGraphML components, including: -- **Project Management**: Projects are a metadata-tracking entity that sit at the top level of ArangoGraphML. All activities must link to a project. -- **Featurization**: The step of converting human-understandable data to machine-understandable data (i.e features), such that it can be used to train Graph Neural Networks (GNNs). -- **Training**: Train a set of models based on the name of the generated/existing features, and a definition of the ML Task we want to solve (e.g Node Classification, Embedding Generation). +{{< tag "ArangoDB Platform" >}} + +The ArangoDB Platform provides an easy-to-use & scalable interface to run +Graph Machine Learning on ArangoDB data. Since all the orchestration and Machine Learning logic is +managed by ArangoDB, all that is typically required are JSON specifications outlining +individual processes to solve a Machine Learning task. + +The `arangoml` Python package allows you to manage all the necessary +GraphML components, including: +- **Project Management**: Projects are a metadata-tracking entity that sit at + the top level of ArangoDB GraphML. All activities must link to a project. +- **Featurization**: The step of converting human-understandable data to + machine-understandable data (e.g. features), such that it can be used to + train Graph Neural Networks (GNNs). +- **Training**: Train a set of models based on the name of the generated/existing + features, and a definition of the ML task you want to solve (e.g. Node Classification, Embedding Generation). - **Model Selection**: Select the best model based on the metrics generated during training. -- **Predictions**: Generate predictions based on the selected model, and persit the results to the source graph (either in the source document, or in a new collection). +- **Predictions**: Generate predictions based on the selected model, and persist + the results to the source graph (either in the source document, or in a new collection). -{{< tip >}} -To enable the ArangoGraphML services in the ArangoGraph platform, -[get in touch](https://www.arangodb.com/contact/) -with the ArangoDB team. Regular notebooks in ArangoGraph don't include the -`arangoml` package. -{{< /tip >}} - -ArangoGraphML's suite of services and packages is driven by **"specifications"**. These specifications are standard Python dictionaries that describe the task being performed, & the data being used. The ArangoGraphML services work closely together, with the previous task being used as the input for the next. +GraphML's suite of services and packages is driven by **"specifications"**. +These specifications are standard Python dictionaries that describe the task +being performed, and the data being used. The GraphML services work closely +together, with the previous task being used as the input for the next. Let's take a look at using the `arangoml` package to: @@ -35,13 +45,9 @@ Let's take a look at using the `arangoml` package to: ## Initialize ArangoML -{{< tabs "arangoml" >}} - -{{< tab "ArangoGraphML" >}} - **API Documentation: [arangoml.ArangoMLMagics.enable_arangoml](https://arangoml.github.io/arangoml/magics.html#arangoml.magic.ArangoMLMagics.enable_arangoml)** -The `arangoml` package comes pre-loaded with every ArangoGraphML notebook environment. +The `arangoml` package comes pre-loaded with every GraphML notebook environment. To start using it, simply import it, and enable it via a Jupyter Magic Command. ```py @@ -49,116 +55,12 @@ arangoml = %enable_arangoml ``` {{< tip >}} -ArangoGraphML comes with other ArangoDB Magic Commands! See the full list [here](https://arangoml.github.io/arangoml/magics.html). +GraphML comes with other ArangoDB Magic Commands! See the full list [here](https://arangoml.github.io/arangoml/magics.html). {{< /tip >}} -{{< /tab >}} - -{{< tab "Self-managed" >}} - -**API Documentation: [arangoml.ArangoML](https://arangoml.github.io/arangoml/client.html#arangoml.main.ArangoML)** - -The `ArangoML` class is the main entry point for the `arangoml` package. -It has the following parameters: -- `client`: An instance of arango.client.ArangoClient. Defaults to `None`. If not provided, the **hosts** argument must be provided. -- `hosts`: The ArangoDB host(s) to connect to. This can be a single host, or a - list of hosts. -- `username`: The ArangoDB username to use for authentication. -- `password`: The ArangoDB password to use for authentication. -- `user_token`: The ArangoDB user token to use for authentication. - This is an alternative to username/password authentication. -- `ca_cert_file`: The path to the CA certificate file to use for TLS - verification. Defaults to `None`. -- `api_endpoint`: The URL to the ArangoGraphML API Service. -- `settings_files`: A list of secrets files to be loaded as settings. Parameters provided as arguments will override those in the settings files (e.g `settings.toml`). -- `version`: The ArangoML API date version. Defaults to the latest version. - -It is possible to instantiate an ArangoML object in multiple ways: - -1. Via parameters -```py -from arangoml import ArangoML - -arangoml = ArangoML( - hosts="http://localhost:8529" - username="root", - password="password", - # ca_cert_file="/path/to/ca.pem", - # user_token="..." - api_endpoint="http://localhost:8501", -) -``` - -2. Via parameters and a custom `ArangoClient` instance -```py -from arangoml import ArangoML -from arango import ArangoClient - -client = ArangoClient( - hosts="http://localhost:8529", - verify_override="/path/to/ca.pem", - hosts_resolver=..., - ... -) - -arangoml = ArangoML( - client=client, - username="root", - password="password", - # user_token="..." - api_endpoint="http://localhost:8501", -) -``` - -3. Via environment variables -```py -import os -from arangoml import ArangoML - -os.environ["ARANGODB_HOSTS"] = "http://localhost:8529" -os.environ["ARANGODB_CA_CERT_FILE"]="/path/to/ca.pem" -os.environ["ARANGODB_USER"] = "root" -os.environ["ARANGODB_PW"] = "password" -# os.environ["ARANGODB_USER_TOKEN"] = "..." -os.environ["ML_API_SERVICES_ENDPOINT"] = "http://localhost:8501" - -arangoml = ArangoML() -``` - -4. Via configuration files -```py -import os -from arangoml import ArangoML - -arangoml = ArangoML(settings_files=["settings_1.toml", "settings_2.toml"]) -``` - -5. Via a Jupyter Magic Command - -**API Documentation: [arangoml.ArangoMLMagics.enable_arangoml](https://arangoml.github.io/arangoml/magics.html#arangoml.magic.ArangoMLMagics.enable_arangoml)** - -``` -%load_ext arangoml -%enable_arangoml -``` -{{< info >}} -This assumes you are working out of a Jupyter Notebook environment, and -have set the environment variables in the notebook environment with user -authentication that has **_system** access. -{{< /info >}} - -{{< tip >}} -Running `%load_ext arangoml` also provides access to other [ArangoGraphML -Jupyter Magic Commands](https://arangoml.github.io/arangoml/magics.html). -{{< /tip >}} - -{{< /tab >}} - -{{< /tabs >}} - ## Load the database -This example is using ArangoML to predict the **class** of `Events` in a +This example is using GraphML to predict the **class** of `Events` in a Knowledge Graph constructed from the [GDELT Project](https://www.gdeltproject.org/). > GDELT monitors the world's news media from nearly every corner of every @@ -180,15 +82,9 @@ news sources, and locations are interconnected into a large graph. ![Example Event](../../../images/ArangoML_open_intelligence_visualization.png) -Let's get started! - -{{< tabs "arangoml" >}} - -{{< tab "ArangoGraphML" >}} - The [`arango-datasets`](../../components/tools/arango-datasets.md) Python package -allows you to load pre-defined datasets into ArangoDB. It comes pre-installed in the -ArangoGraphML notebook environment. +allows you to load pre-defined datasets into ArangoDB Platform. It comes pre-installed in the +GraphML notebook environment. ```py DATASET_NAME = "OPEN_INTELLIGENCE_ANGOLA" @@ -199,42 +95,11 @@ DATASET_NAME = "OPEN_INTELLIGENCE_ANGOLA" %load_dataset {DATASET_NAME} ``` -{{< /tab >}} - -{{< tab "Self-managed" >}} - -The [`arango-datasets`](../../components/tools/arango-datasets.md) Python package -allows you to load pre-defined datasets into ArangoDB. It can be installed with the -following command: - -``` -pip install arango-datasets -``` - -```py -from arango_datasets.datasets import Datasets - -DATASET_NAME = "OPEN_INTELLIGENCE_ANGOLA" - -db = arangoml.client.db( - name=DATASET_NAME, - username=arangoml.settings.get("ARANGODB_USER"), - password=arangoml.settings.get("ARANGODB_PW"), - user_token=arangoml.settings.get("ARANGODB_USER_TOKEN"), - verify=True -) - -Datasets(dataset_db).load(DATASET_NAME) -``` -{{< /tab >}} - -{{< /tabs >}} - ## Projects **API Documentation: [ArangoML.projects](https://arangoml.github.io/arangoml/api.html#projects)** -Projects are an important reference used throughout the entire ArangoGraphML +Projects are an important reference used throughout the entire GraphML lifecycle. All activities link back to a project. The creation of the project is very simple. @@ -410,7 +275,6 @@ Once a Featurization Job has been submitted, you can wait for it to complete usi featurization_job_result = arangoml.wait_for_featurization(featurization_job.job_id) ``` - **Example Output:** ```py { @@ -511,12 +375,11 @@ You can also cancel a Featurization Job using the `arangoml.jobs.cancel_job` met arangoml.jobs.cancel_job(prediction_job.job_id) ``` - ## Training **API Documentation: [ArangoML.jobs.train](https://arangoml.github.io/arangoml/api.html#agml_api.jobs.v1.api.jobs_api.JobsApi.train)** -Training Graph Machine Learning Models with ArangoGraphML requires two steps: +Training Graph Machine Learning Models with GraphML requires two steps: 1. Describe which data points should be included in the Training Job. 2. Pass the Training Specification to the Training Service. @@ -553,12 +416,13 @@ A Training Specification allows for concisely defining your training task in a single object and then passing that object to the training service using the Python API client, as shown below. -The ArangoGraphML Training Service is responsible for training a series of +The GraphML Training Service is responsible for training a series of Graph Machine Learning Models using the data provided in the Training Specification. It assumes that the data has been featurized and is ready to be used for training. -Given that we have run a Featurization Job, we can create the Training Specification using the `featurization_job_result` object returned from the Featurization Job: +Given that we have run a Featurization Job, we can create the Training Specification +using the `featurization_job_result` object returned from the Featurization Job: ```py # 1. Define the Training Specification @@ -736,10 +600,11 @@ arangoml.jobs.cancel_job(training_job.job_id) ## Model Selection Model Statistics can be observed upon completion of a Training Job. -To select a Model, the ArangoGraphML Projects Service can be used to gather +To select a Model, the GraphML Projects Service can be used to gather all relevant models and choose the preferred model for a Prediction Job. -First, let's list all the trained models using [ArangoML.list_models](https://arangoml.github.io/arangoml/client.html#arangoml.main.ArangoML.list_models): +First, let's list all the trained models using +[ArangoML.list_models](https://arangoml.github.io/arangoml/client.html#arangoml.main.ArangoML.list_models): ```py # 1. List all trained Models @@ -752,7 +617,11 @@ models = arangoml.list_models( print(len(models)) ``` -The cell below selects the model with the highest **test accuracy** using [ArangoML.get_best_model](https://arangoml.github.io/arangoml/client.html#arangoml.main.ArangoML.get_best_model), but there may be other factors that motivate you to choose another model. See the `model_statistics` in the output field below for more information on the full list of available metrics. +The cell below selects the model with the highest **test accuracy** using +[ArangoML.get_best_model](https://arangoml.github.io/arangoml/client.html#arangoml.main.ArangoML.get_best_model), +but there may be other factors that motivate you to choose another model. See +the `model_statistics` in the output field below for more information on the +full list of available metrics. ```py @@ -848,7 +717,10 @@ collection, or within the source documents. - `modelID`: The model ID to use for generating predictions. - `featurizeNewDocuments`: Boolean for enabling or disabling the featurization of new documents. Useful if you don't want to re-train the model upon new data. Default is `false`. - `featurizeOutdatedDocuments`: Boolean for enabling or disabling the featurization of outdated documents. Outdated documents are those whose features have changed since the last featurization. Default is `false`. -- `schedule`: A cron expression to schedule the prediction job (e.g `0 0 * * *` for daily predictions). Default is `None`. +- `schedule`: A cron expression to schedule the prediction job. The cron syntax is a set of + five fields in a line, indicating when the job should be executed. The format must follow + the following order: `minute` `hour` `day-of-month` `month` `day-of-week` + (e.g. `0 0 * * *` for daily predictions at 00:00). Default is `None`. - `embeddingsField`: The name of the field to store the generated embeddings. This is only used for Graph Embedding tasks. Default is `None`. ```py diff --git a/site/content/3.13/data-science/graphml/quickstart.md b/site/content/3.13/data-science/graphml/quickstart.md new file mode 100644 index 0000000000..97b9fabeb3 --- /dev/null +++ b/site/content/3.13/data-science/graphml/quickstart.md @@ -0,0 +1,57 @@ +--- +title: How to get started with GraphML +menuTitle: Quickstart +weight: 5 +description: >- + You can use GraphML straight within the ArangoDB Platform, via the web interface + or via Notebooks +aliases: + - ../arangographml/deploy +--- + +## Web interface versus Jupyter Notebooks + +The ArangoDB Platform provides enterprise-ready Graph Machine Learning in two options, +tailored to suit diverse requirements and preferences: +- Using the web interface +- In a scriptable manner, using the integrated Jupyter Notebooks and the HTTP API for GraphML + +## Setup + +{{< tabs "graphml-setup" >}} + +{{< tab "Web Interface" >}} +The web interface of the ArangoDB Platform allows you to create, configure, and +run a full machine learning workflow for GraphML. To get started, see the +[Web interface for GraphML](ui.md) page. +{{< /tab >}} + +{{< tab "Notebooks" >}} +The ArangoDB Notebooks service runs on the +[ArangoGraph Insights Platform](https://dashboard.arangodb.cloud/home?utm_source=docs&utm_medium=cluster_pages&utm_campaign=docs_traffic). +It offers a pre-configured environment where everything, +including necessary components and configurations, comes preloaded. You don't +need to set up or configure the infrastructure, and can immediately start using the +GraphML functionalities in a scriptable manner. To get started, see the +[GraphML Notebooks & API](notebooks-api.md) reference documentation. + +{{< tip >}} +To get access to GraphML services and packages in ArangoGraph Insights Platform, +[get in touch](https://www.arangodb.com/contact/) +with the ArangoDB team. +{{< /tip >}} + +- **Accessible at all levels** + - Low code UI + - Notebooks + - APIs +- **Full usability** + - MLOps lifecycle + - Metrics + - Metadata capture + - Model management + +![ArangoGraphML Pipeline](../../../images/ArangoGraphML_Pipeline.png) +{{< /tab >}} + +{{< /tabs >}} \ No newline at end of file diff --git a/site/content/3.13/data-science/graphml/ui.md b/site/content/3.13/data-science/graphml/ui.md new file mode 100644 index 0000000000..45ef75dd6a --- /dev/null +++ b/site/content/3.13/data-science/graphml/ui.md @@ -0,0 +1,244 @@ +--- +title: How to use GraphML in the ArangoDB Platform web interface +menuTitle: Web Interface +weight: 10 +description: >- + Learn how to create, configure, and run a full machine learning workflow for + GraphML in four steps using the features in the ArangoDB web interface +--- +{{< tag "ArangoDB Platform" >}} + +## The GraphML workflow in the web interface + +The entire process is organized into sequential steps within a **Project**, +giving you a clear path from data to prediction: + +1. **Featurization**: Select your data and convert it into numerical representations. +2. **Training**: Train a GraphSAGE model on the features and graph structure. +3. **Model Selection**: Evaluate the trained models and choose the best one. +4. **Prediction**: Use the selected model to generate predictions on your data. + You can also automate the prediction process to run at regular intervals. + +## Create a GraphML project + +To create a new GraphML project using the ArangoDB Platform web interface, follow these steps: + +1. From the left-hand sidebar, select the database where you want to create the project. +2. In the left-hand sidebar, click **GenAI** to open the GraphML project management interface, then click **Run GraphML**. + ![Navigate to Data Science](../../../images/datascience-intro.jpg) +3. In the **GraphML projects** view, click **Add new project**. +4. The **Create ML project** modal opens. Enter a **Name** for your machine learning project. +5. Click the **Create project** button to finalize the creation. +6. After creation, the new project appears in the list under **GraphML projects**. + Click the project name to begin with a Featurization job. + +## Featurization phase + +After clicking on a project name, you are taken to a screen where you can +configure and start a new Featurization job. Follow these steps: +1. **Select a Graph**: In the **Features** section, choose your target graph from the **Select a graph** dropdown menu. +2. **Select Vertex Collection(s)**: Pick the vertex collection(s) that you want to include for feature extraction. +3. **Select Attributes**: Choose the attributes from your vertex collection to + convert into machine-understandable features. Attributes cannot be used if their values are lists or arrays. + +{{< info >}} +A metagraph is basically just the configured subset of a graph (the vertex and +edge collections and the specified attributes). This is what you see represented +in the metagraph object in the JSON specification on the right. +{{< /info >}} + +### Configuration options + +The featurization process has several configurable options, grouped into +**Configuration** and **Advanced** settings. These are also shown in a JSON +format on the right side of the screen for transparency. + +In the **Configuration** tab, you can control the overall featurization job and +how features are stored. +- **Batch size**: The number of documents to process in a single batch. +- **Run analysis checks**: Whether to run analysis checks to perform a high-level + analysis of the data quality before proceeding. The default value is `true`. +- **Skip labels**: Skip the featurization process for attributes marked as labels. + The default value is `false`. +- **Overwrite FS graph**: Whether to overwrite the Feature Store graph if features + were previously generated. The default value is `false`, therefore features are + written to an existing Feature Store graph. +- **Write to source graph**: Whether to store the generated features on the Source + Graph. The default value is `true`. +- **Use feature store**: Enable the use of the Feature Store database, which + allows you to store features separately from your Source Database. The default + value is `false`, therefore features are written to the source graph. + +### Handling imperfect data + +Real-world datasets often contain missing values or mismatched data types. Use +the strategies below to control how each feature type (**Text**, **Numeric**, +**Category**, **Label**) handles these issues during featurization. + +| **Strategy type** | **Option** | **Description** | **When to use** | +|-------------------|-----------------------|-----------------------------------------------------------------------------------------------------|---------------------------------------------------------------| +| Missing | **Raise** | Stops the job and reports an error when a value is missing. | When missing data indicates a critical issue. | +| | **Replace** | Substitutes missing values with a default you provide (e.g., `0` for numbers, `"unknown"` for text). | When missing values are expected. | +| Mismatch | **Raise** | The strictest option. Stops the job on any data type mismatch. | When any data type mismatch indicates a critical error. | +| | **Replace** | Replaces mismatched values with a default you provide, without trying to convert it first. | When mismatched values are unreliable, and you prefer to substitute it directly. | +| | **Coerce and Raise** | Attempts to convert (coerce) the value to the correct type (e.g. string "123" to number `123`). If the conversion is successful, it uses the new value. If it fails, the job stops. | A balanced approach, often the best default strategy. | +| | **Coerce and Replace**| The most forgiving option. The system first tries to convert the value. If it fails, it replaces the value with the specified default and continues the job. | For very dirty datasets where completing the job is the highest priority. | + +Once you’ve set your strategies, click **Begin featurization** to start the node +embedding-compatible featurization job. When the job status updates to +**Ready for training**, proceed to the **Training** step. + +![Navigate to Featurization](../../../images/graph-ml-ui-featurization.png) + +## Training phase + +The training is the second step in the ML workflow after featurization. +In the training phase, you configure and launch a machine learning training +job on your graph data. + +From the **Select a type of training job** dropdown menu, choose the type of +model you want to train (**Node Classification** or **Node Embeddings**). + +#### Node classification + +Node Classification is used to categorize the nodes in your graph based on their +features and structural connections within the graph. + +**Use cases include:** +- Entity categorization (e.g. movies into genres, users into segments) +- Fraud detection in transaction networks + +**Configuration parameters:** +- **Type of Training Job**: Node classification +- **Target Vertex Collection**: Choose the collection to classify (e.g. `movie`) +- **Batch Size**: The number of documents processed in a single training iteration. (e.g. `256`) +- **Data Load Batch Size**: The number of documents loaded from ArangoDB into memory in a single batch during the data loading phase (e.g. `50000`). +- **Data Load Parallelism**: The number of parallel processes used when loading data from ArangoDB into memory for training (e.g. `10`). + +After setting these values, click the **Begin training** button to start the job. + +![Node Classification](../../../images/ml-nodeclassification.png) + +#### Node embeddings + +Node Embeddings are used to generate vector embeddings (dense numerical representations) +of graph nodes that capture structural and feature-based information. + +**Use cases include:** +- Similarity search (e.g. finding similar products, users, or documents) +- Link prediction (e.g. suggesting new connections) + +**Configuration parameters:** +- **Type of Training Job:** Node embeddings +- **Target Vertex Collection:** Select the collection to generate embeddings for (e.g. `movie` or `person`) +- No label is required for training in this mode + +**Understanding Vertex Collections:** + +- **Vertex Collection**: These are the source nodes used during training. + They represent the full set of nodes on which features were computed (e.g. `person`, `movie`). +- **Vertex Collection**: These are the target nodes that contain labeled data. + The labels in this collection are used to supervise the training process and + are the basis for evaluating prediction quality. + +The target collection is where the model's predictions are stored when running a prediction job. + +Once the configuration is complete, click **Begin training** to start the embedding job. + +![Node Embeddings](../../../images/ml-node-embedding.png) + +## Model selection phase + +Once the training is finished, the job status updates to **READY FOR MODEL SELECTION**. +This means the model has been trained using the provided vertex and edge data +and is now ready for evaluation. + +A list of trained models is displayed, along with performance metrics +(**Accuracy**, **Precision**, **Recall**, **F1 score**, **Loss**). Review the results of different +model runs and configurations. + +![GraphML Model Selection](../../../images/graph-ml-model.png) + +Select the best performing model suitable for your prediction task. You can also +open the **Confusion Matrix** to compare predicted values versus actual values. + +![GraphML Confusion Matrix](../../../images/graphml-ui-confusion-matrix.png) + +## Prediction phase + +After selecting a model, you can create a Prediction Job. The Prediction Job +generates predictions and persists them to the source graph, either in a new +collection or within the source documents. + +The Prediction interface allows inference to be run using the selected model. +It enables configuration of how predictions are executed, which collections are +involved, and whether new or outdated documents should be automatically featurized +before prediction. + +You have two important options: + +- **Featurize new documents:** Enable this option to generate features for + documents that have been added since the model was trained. This is useful + for getting predictions on new data without having to retrain the model. +- **Featurize outdated documents:** Enable this option to re-generate features + for documents that have been modified. Outdated documents are those whose + attributes (used during featurization) have changed since the last feature + computation. This ensures your predictions reflect the latest changes to your data. + +In addition to these settings, you can also define the target data, where to store +results, and whether to run the job on a recurring schedule. + +These options provide flexibility in handling dynamic graph data and keeping +predictions relevant without repeating the entire ML workflow. + +- **Data load batch size**: Specifies the number of documents to load in a + single batch (e.g. `500000`). +- **Data load parallelism**: The number of parallel threads used to process + the prediction workload (e.g. `10`). +- **Prediction field**: The field in the documents where the predicted values are stored. + +![GraphML prediction phase](../../../images/graph-prediction.png) + +### Configuration options + +The Prediction screen displays the following configuration options: +- **Select Model**: Displays the model selected during the Model Selection phase. This model will be used to perform inference. +- **Target Vertex Collection**: This is the vertex collection on which predictions are applied. +- **Prediction Type**: Depending on the training job (for example, classification or embedding), the prediction outputs class labels or updated embeddings. + +### Enable scheduling + +You can configure automatic predictions using the **Enable scheduling** checkbox. +When scheduling is enabled, predictions run automatically based on a set CRON +expression. This helps keep prediction results up-to-date as new data is added to the system. + +You can define a cron expression that sets when the prediction job should run. +The cron syntax is a set of five fields in a line, indicating when the job should +be executed. The format must follow the following order: `minute` `hour` `day-of-month` `month` `day-of-week` +(e.g. `0 0 * * *` for daily predictions at 00:00, or `0 0 1 1 *` to execute the prediction +on January 1st at 00:00). + +When a field is set to an asterisk `*`, it means that any value is allowed for that field, +whenever the other field conditions are met. + +Below the CRON field, a user-friendly scheduling interface helps translate it: +- **Period**: Options include **Hourly**, **Daily**, **Weekly**, **Monthly**, or **Yearly**. +- **Month**: Indicates the month. For example, `1` for January. +- **Day of Month**: Indicates the day of the month. For example, `1` for + the first day of the month. +- **Day of Week** (optional): Indicates the day of the week. For example, + Monday is `1` and Tuesday is `2`. +- **Hours and Minutes**: Set the exact time for execution. For example, + if the hour is set to `8` and the minute to `0`, then the job runs at 8:00 AM. + +### Execute prediction + +After reviewing the configuration, click the **Run Prediction** button. +Once prediction is complete, you can analyze the results directly in +the Web Interface or export them for downstream use. + +## Limitations + +- **Edge Attributes**: The current version of GraphML does not support the use of edge attributes as features. +- **Dangling Edges**: Edges that point to non-existent vertices ("dangling edges") are not caught during the featurization analysis. They may cause errors later, during the Training phase. +- **Memory Usage**: Both featurization and training can be memory-intensive. Out-of-memory errors can occur on large graphs with insufficient system resources. \ No newline at end of file diff --git a/site/content/3.13/index-and-search/indexing/working-with-indexes/vector-indexes.md b/site/content/3.13/index-and-search/indexing/working-with-indexes/vector-indexes.md index 90b82edbf9..236093878b 100644 --- a/site/content/3.13/index-and-search/indexing/working-with-indexes/vector-indexes.md +++ b/site/content/3.13/index-and-search/indexing/working-with-indexes/vector-indexes.md @@ -33,7 +33,7 @@ startup option needs to be enabled on the deployment you want to restore to. {{< /warning >}} 1. Enable the experimental vector index feature. -2. Calculate vector embeddings using [ArangoDB's GraphML](../../../data-science/arangographml/_index.md) +2. Calculate vector embeddings using [ArangoDB's GraphML](../../../data-science/graphml/_index.md) capabilities (available in ArangoGraph) or using external tools. Store each vector as an attribute in the respective document. 3. Create a vector index over this attribute. You need to choose which diff --git a/site/content/images/graphml-ui-confusion-matrix.png b/site/content/images/graphml-ui-confusion-matrix.png new file mode 100644 index 0000000000..1ddfa5f9b8 Binary files /dev/null and b/site/content/images/graphml-ui-confusion-matrix.png differ