Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Single Node cluster #411

Closed
kamakay opened this issue Nov 14, 2020 · 13 comments · Fixed by #454
Closed

[FEATURE] Single Node cluster #411

kamakay opened this issue Nov 14, 2020 · 13 comments · Fixed by #454
Assignees
Labels
azure Occurring on Azure cloud
Milestone

Comments

@kamakay
Copy link

kamakay commented Nov 14, 2020

Dear all,

I would ask any help in order to add a single node cluster on azure databricks using

  • terraform: v0.13.4
  • azurerm: 2.32.0
  • databrickslab tf provider: 0.2.7

Here my code

resource "databricks_cluster" "single_node_etl" {
  cluster_name            = "SINGLENODE-ETL"
  spark_version           = "6.6.x-scala2.11"
  node_type_id            = "Standard_DS3_v2"
  driver_node_type_id    = "Standard_DS3_v2"
  autotermination_minutes = 15
  num_workers = 0
  
  spark_conf = {
    "spark.master" = "local[*]"        # Single Node
    "spark.databricks.cluster.profile" = "singleNode",     # Single Node
    "spark.databricks.passthrough.enabled" = true,             # AAD Passthrough
  }
}

but everytime i run terraform apply i receive the error:

Error: Missing required field: size

I have tried different configurations but nothing appears to help me.

Any help willl be really appreciated.

Thx

@sudeepgupta90
Copy link

We would also be using single node cluster. Has this been solved, or is this still an issue?

@kamakay
Copy link
Author

kamakay commented Dec 3, 2020

Hi, from my point of view no one provided us a solution.

@nfx
Copy link
Contributor

nfx commented Dec 3, 2020

@sudeepgupta90 @kamakay right now busy with preparing next release. If you want to speed things up - pull requests are more than welcome :)

@nfx nfx changed the title [ISSUE] Single Node cluster [FEATURE] Single Node cluster Dec 7, 2020
@lyogev
Copy link

lyogev commented Dec 29, 2020

I see this was committed but it's still not working:
c28b209

Any idea why?

@kamakay
Copy link
Author

kamakay commented Dec 29, 2020

What kind of problem have you found? Do you have the issue when executing the terraform script?

@lyogev
Copy link

lyogev commented Dec 29, 2020

When I set the num_workers to 0 I get Error: Missing required field: size with 0.2.9 .

@nfx
Copy link
Contributor

nfx commented Dec 31, 2020

@lyogev @kamakay c28b209 was breaking databricks_job resource, as the change was not well tested. Starting from 0.2.7, both cluster and job resources rely on the same internal struct to have consistent configuration behavior.

It was breaking the release, so i've reverted that commit. it would be fixed somewhere in 0.3.x

@lyogev @kamakay @sudeepgupta90 @kung-foo @tkasu what use-cases do you foresee with this feature? would it drive DBU usage up?

@kamakay
Copy link
Author

kamakay commented Dec 31, 2020

Hi thanks for the feedback, I thing the first thing to understand is if the underlying databricks api are supporting num_workers = 0 for single node cluster.
Single node cluster is a supported databricks type of cluster, used for small workloads or ingestions activities that don't need a full distributed processing cluster but the good data management features provided by Spark.

alexott added a commit that referenced this issue Jan 4, 2021
@nfx nfx added this to the v0.3.0 milestone Jan 12, 2021
@nfx nfx closed this as completed in #454 Jan 22, 2021
nfx pushed a commit that referenced this issue Jan 22, 2021
@nfx nfx added the azure Occurring on Azure cloud label Feb 23, 2022
@JeremyDemlow
Copy link

Plan: 2 to add, 0 to change, 0 to destroy.
databricks_cluster.modeling_cluster: Creating...

Error: cannot create cluster: NumWorkers could be 0 only for SingleNode clusters. See https://docs.databricks.com/clusters/single-node.html for more details

with databricks_cluster.modeling_cluster,
on main.tf line 57, in resource "databricks_cluster" "modeling_cluster":
57: resource "databricks_cluster" "modeling_cluster" {}}

This seems to be coming out again this was working last weekend and now is failing

image

for now I am working with small testing so I will bump it to 1 but num_workers were 0 before and was working great will come to update if it works

@JeremyDemlow
Copy link

What I am actually trying to be able to do here is install libraries, but it doesn't seem clear how to install libraries on a databricks_job cluster

I had tried

Iibrary {
pypi {
...
}
}

and said it wasn't allowed, which seems a little odd

job_cluster - (Optional) A list of job databricks_cluster specifications that can be shared and reused by tasks of this job. Libraries cannot be declared in a shared job cluster. You must declare dependent libraries in task settings. Multi-task syntax

This is fine but there isn't really a good way to see how this is done effectively and my DB technical rep said that docker image bases would lead to to much complexity

@alexott
Copy link
Contributor

alexott commented May 31, 2023

Libraries are installed to jobs differently than on clusters - look into specific task definition. It's how Workflows are defined

@JeremyDemlow
Copy link

JeremyDemlow commented May 31, 2023

@alexott

Here is a simple way to do it on the cluster, if you have an example that supports this that would be amazing as all documentation I have seen does not show that process happening in the job_cluster new_cluster process, but these would be essentially the same thing on the back end if I was to create and interactive cluster that stand-alone, which make working with databricks jobs more complicated, and additionally the docs say the following:

spark_submit_task Configuration Block
You can invoke Spark submit tasks only on new clusters. In the new_cluster specification, libraries and spark_conf are not supported. Instead**, use --jars and --py-files to add Java and Python libraries and --conf to set the Spark configuration.** By default, the Spark submit job uses all available memory (excluding reserved memory for Databricks services). You can set --driver-memory, and --executor-memory to a smaller value to leave some room for off-heap usage. Please use spark_jar_task, spark_python_task or notebook_task wherever possible.

, use --jars and --py-files to add Java and Python libraries and --conf to set the Spark configuration.

This seems like a lot of overkill to get an install working on a cluster when it is supported. Yes, the docker image being passed to the cluster seems the easier way to do this overall to keep things stable, but this takes a little more setup than I have time to support currently, but will eventually move in that direction.

Ideally I have a down image in my repo that I want created in the set up the job cluster that will allow that to happen but I have not seen that happen anywhere yet.

resource "databricks_cluster" "shared_autoscaling" {
  cluster_name            = "Shared Autoscaling"
  spark_version           = data.databricks_spark_version.latest_lts.id
  node_type_id            = data.databricks_node_type.smallest.id
  autotermination_minutes = 20
  autoscale {
    min_workers = 1
    max_workers = 2
  }
  dynamic "library" {
    for_each = toset(var.listOfMavenPackages)
    content {
      maven {
        coordinates = library.value
      }
    }
  }
}

I know how to do this when it's not a job cluster. The docs don't show how to approach it

@JeremyDemlow
Copy link

@everyone

for any of you wondering what to do here with the jobs as this isn't extremely obvious

 dynamic "task" {
    for_each = var.task_lists

    content {
      task_key = task.value.task_resource_name

      dynamic "pipeline_task" {
        for_each = task.value.task_type == "data" ? [1]: []
        content {
          pipeline_id = task.value.task_resource_id
        }
      }

      dynamic "notebook_task" {
        for_each = task.value.task_type == "notebook" ? [1] : []
        content {
          notebook_path = task.value.task_resource_id
        }
      }

      dynamic "depends_on" {
        for_each = var.task_dependency[task.value.task_resource_name] == "" ? [] : [var.task_dependency[task.value.task_resource_name]]
        content {
          task_key = var.task_dependency[task.value.task_resource_name]
        }
      }
      dynamic "library" {
        for_each = toset(var.list_of_R_packages)
        content {
          cran {
            package = library.value
          }
        }
      }
      
      job_cluster_key = task.value.task_type == "notebook" ? var.job_cluster_key : null
    }
  }

This is how I had come through and figured it out now. The variable is just a list of R packages

Ideally, if I remember ill come through and show what we did for the docker image but moving forward with the project for now

this is what it looks like in terraform

task {
          + job_cluster_key  = "Cluster"
          + retry_on_timeout = (known after apply)
          + task_key         = "EXECUTE"

          + depends_on {
              + task_key = "00_SayHello"
            }

          + library {
              + cran {
                  + package = "Robyn"
                }
            }
          + library {
              + cran {
                  + package = "patchwork"
                }
            }

          + notebook_task {
              + notebook_path = "/Repos/GITHUB/TestProject/FakePath/Fake_File"
            }
        }

happy coding everyone seems like single node is working again today as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
azure Occurring on Azure cloud
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants