Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: update conf about hadoop and create index #3846

Merged
merged 5 commits into from
Apr 15, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 8 additions & 12 deletions docs/en/deploy/conf.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,11 +24,11 @@
#--request_max_retry=3
# Configure the request timeout in milliseconds, the default is 12 seconds
#--request_timeout_ms=12000
# Configure the retry interval when the request is unreachable, generally do not need to be modified, in milliseconds
# Configure the retry interval when the request is unreachable, generally does not need to be modified, in milliseconds
#--request_sleep_time=1000
# Configure the zookeeper session timeout in milliseconds
--zk_session_timeout=10000
# Configure the zookeeper health check interval, the unit is milliseconds, generally do not need to be modified
# Configure the zookeeper health check interval, the unit is milliseconds, generally does not need to be modified
#--zk_keep_alive_check_interval=15000
# Configure the timeout period for tablet heartbeat detection in milliseconds, the default is 1 minute. If the tablet is still unreachable after this time, the nameserver considers that the tablet is unavailable and will perform the operation of offline the node
--tablet_heartbeat_timeout=60000
Expand Down Expand Up @@ -265,7 +265,6 @@ spark.default.conf=
spark.eventLog.dir=
spark.yarn.maxAppAttempts=1
batchjob.jar.path=
namenode.uri=
offline.data.prefix=file:///tmp/openmldb_offline_storage/
hadoop.conf.dir=
#enable.hive.support=false
Expand Down Expand Up @@ -305,10 +304,7 @@ Spark tasks are executed locally on TaskManager deployment machine. Please note
- `offline.data.prefix` is set by default as `file:///tmp/openmldb_offline_storage/`, which is on TaskManager deployment machine. This can be set to other locations as required.
- **Before starting TaskManager**, HDFS path can be configured by setting environment variable `HADOOP_CONF_DIR` to Hadoop configuration directory (Note: it is the environment variable, not the configuration item). The directory needs to include `core-site.xml`, `hdfs-site.xml` configuration files. For more information, refer to [Spark documentation](https://spark.apache.org/docs/3.2.1/configuration.html#inheriting-hadoop-cluster-configuration).

```{note}
Currently, `namenode.uri` needs to be configured. When deleting an offline table, HDFS FileSystem `namenode.uri` will be connected, and offline table path will be deleted. This item will be discarded in future updates.
```
- batchjob path `batchjob.jar.path` can be set automativally. It can be configured to other paths.
- batchjob path `batchjob.jar.path` can be set automatically. It can be configured to other paths.

```{seealso}
if Hadoop/Yarm requires Kerberos authentication, refer to [FAQ](../faq.md).
Expand All @@ -321,21 +317,21 @@ if Hadoop/Yarm requires Kerberos authentication, refer to [FAQ](../faq.md).

- `spark.yarn.jars` Configure Spark jar for Yarn. It has to be a `hdfs://` path. You can update the `jars` directory from [OpenMLDB Spark distribution](../../tutorial/openmldbspark_distribution.md) to HDFS, and set it as `hdfs://<hdfs_path>/jars/*`. If not set, Yarn will package and distribute `$SPARK_HOME/jars`, and will do so [for each offline task](https://spark.apache.org/docs/3.2.1/running-on-yarn.html#preparations). Therefore, we recommend it to be set.

- `batchjob.jar.path` It has to be a HDFS path. Upload batchjob jar to HDFS and configure the respective path. Make sure that all Workers in Yarn cluster have access to the batchjob jar.
- `batchjob.jar.path` It has to be an HDFS path. Upload batchjob jar to HDFS and configure the respective path. Make sure that all Workers in Yarn cluster have access to the batchjob jar.

- `offline.data.prefix` It has to be a HDFS path. Make sure that all Workers in Yarn cluster have access. Use the environment variable `HADOOP_CONF_DIR`.
- `offline.data.prefix` It has to be an HDFS path. Make sure that all Workers in Yarn cluster have access. Use the environment variable `HADOOP_CONF_DIR`.


##### `yarn-client` Mode

[Driver executes locally](https://spark.apache.org/docs/3.2.1/running-on-yarn.html#launching-spark-on-yarn)and executor execute on the Yarn cluster. Configurations are the same as `yarn-cluster`.
[Driver executes locally](https://spark.apache.org/docs/3.2.1/running-on-yarn.html#launching-spark-on-yarn), and the executor executes on the Yarn cluster. Configurations are the same as `yarn-cluster`.

#### `spark.default.conf`

Format is `key=value`, use `;` to separate. For example:
The format is `key=value`, use `;` to separate. For example:

```
spark.default.conf=spark.executor.instances=2;spark.executor.memory=2g;spark.executor.cores=2
```
Same affect as `--conf`. For more configurations, refer to [Spark documentation](https://spark.apache.org/docs/3.1.2/configuration.html).
Same effect as `--conf`. For more configurations, refer to [Spark documentation](https://spark.apache.org/docs/3.1.2/configuration.html).

2 changes: 1 addition & 1 deletion docs/en/openmldb_sql/ddl/CREATE_INDEX_STATEMENT.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# CREATE INDEX

The `CREATE INDEX` statement is used to create a new index on existing table. Running `CREATE INDEX` will initiates an asynchronous job, and you can check the status of the job by executing `SHOW JOBS FROM NAMESERVER`.
The `CREATE INDEX` statement is used to create a new index on an existing table. Running `CREATE INDEX` initiates an asynchronous job, and you can check the status of the job by executing `SHOW JOBS FROM NAMESERVER`. Please note that the index is not available until the asynchronous task is completed, and any scenarios that require the new index will fail.

## Syntax

Expand Down
71 changes: 34 additions & 37 deletions docs/zh/deploy/conf.md
Original file line number Diff line number Diff line change
Expand Up @@ -270,7 +270,6 @@ spark.default.conf=
spark.eventLog.dir=
spark.yarn.maxAppAttempts=1
batchjob.jar.path=
namenode.uri=
offline.data.prefix=file:///tmp/openmldb_offline_storage/
hadoop.conf.dir=
hadoop.user.name=
Expand Down Expand Up @@ -308,52 +307,24 @@ TaskManager只接受`local`及其变种、`yarn`、`yarn-cluster`、`yarn-client
##### local模式

local模式即Spark任务运行在本地(TaskManager所在主机),该模式下不需要太多配置,只需要注意两点:
- 离线表的存储地址`offline.data.prefix`,默认为`file:///tmp/openmldb_offline_storage/`,即TaskManager所在主机的`/tmp`目录,你可以修改该配置为其他目录。
vagetablechicken marked this conversation as resolved.
Show resolved Hide resolved
- 可以配置为HDFS路径,如果配置为HDFS路径,需要正确配置变量 `hadoop.conf.dir` 和 `hadoop.user.name`,其中 `hadoop.conf.dir` 表示Hadoop配置文件所在目录(注意该目录是TaskManager节点目录;文件目录中应包含Hadoop的`core-site.xml`、`hdfs-site.xml`等配置文件,更多见[Spark官方文档](https://spark.apache.org/docs/3.2.1/configuration.html#inheriting-hadoop-cluster-configuration)),`hadoop.user.name` 表示hadoop运行用户,可以通过以下三种方式之一配置这两个变量:
1. 在 `conf/taskmanager.properties` 配置文件中配置变量 `hadoop.conf.dir`, `hadoop.user.name`
2. 在(TaskManager节点)**启动TaskManager前**配置环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME`
3. 拷贝Hadoop配置文件(`core-site.xml`、`hdfs-site.xml`等)到 `{spark.home}/conf` 目录中
> sbin部署不能传递非指定的变量,目前TaskManager只会传递环境变量 `SPARK_HOME` 和 `RUNNER_JAVA_HOME`。所以如果是sbin部署,尽量使用第一种方法。
>
> 如果使用第二种方式,配置的环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME` 最好是永久生效的,如果不希望环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME` 永久生效,可以在一个session里,先临时配置环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME` ,然后启动TaskManager,例如
> ```bash
> cd <openmldb部署根目录>
> export HADOOP_CONF_DIR=<这里替换为Hadoop配置目录>
> export HADOOP_USER_NAME=<这里替换为Hadoop用户名>
> bash bin/start.sh start taskmanager
> ```
>
> 环境变量生效范围参考 <a href="#about-config-env">理解配置项与环境变量的关系</a>
```{note}
HDFS路径目前需要配置`namenode.uri`,删除离线表时会连接HDFS FileSystem`namenode.uri`,并删除离线表的存储目录(Offline Table Path)。未来将废弃此配置项。
```
- 离线表的存储地址`offline.data.prefix`,默认为`file:///tmp/openmldb_offline_storage/`,即TaskManager所在主机的`/tmp`目录。如果TaskManager换机器,数据无法自动迁移,多机部署TaskManager时不建议使用`file://`。可以配置为HDFS路径,需要配置变量 `hadoop.conf.dir` 和 `hadoop.user.name`,详情见[Hadoop相关配置](#hadoop相关配置)。

- batchjob的路径`batchjob.jar.path`可自动获取,无需配置,如果你要使用别处的batchjob,可以配置该参数。

```{seealso}
如果Hadoop/Yarn需要Kerberos认证,参考[FAQ](../faq/client_faq.md#如何配置taskmanager来访问开启kerberos的yarn集群)。
```


##### yarn/yarn-cluster模式
"yarn"和"yarn-cluster"是同一个模式,即Spark任务运行在Yarn集群上,该模式下需要配置的参数较多,主要包括:
- 正确配置变量 `hadoop.conf.dir` 和 `hadoop.user.name`,其中 `hadoop.conf.dir` 表示Hadoop和Yarn配置文件所在目录(注意该目录是TaskManager节点目录;文件目录中应包含Hadoop的`core-site.xml`、`hdfs-site.xml`, `yarn-site.xml`等配置文件,参考[Spark官方文档](https://spark.apache.org/docs/3.2.1/running-on-yarn.html#launching-spark-on-yarn)),`hadoop.user.name` 表示hadoop运行用户,可以通过以下三种方式之一配置这两个变量:
1. 在 `conf/taskmanager.properties` 配置文件中配置变量 `hadoop.conf.dir`, `hadoop.user.name`
2. 在(TaskManager节点)**启动TaskManager前**配置环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME`
3. 拷贝Hadoop和Yarn配置文件(`core-site.xml`、`hdfs-site.xml`等)到 `{spark.home}/conf` 目录中
> sbin部署不能传递非指定的变量,目前TaskManager只会传递环境变量 `SPARK_HOME` 和 `RUNNER_JAVA_HOME`。所以如果是sbin部署,尽量使用第一种方法。
>
> 如果使用第二种方式,配置的环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME` 最好是永久生效的,如果不希望环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME` 永久生效,可以在一个session里,先临时配置环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME` ,然后启动TaskManager,例如
> ```bash
> cd <openmldb部署根目录>
> export HADOOP_CONF_DIR=<这里替换为Hadoop配置目录>
> export HADOOP_USER_NAME=<这里替换为Hadoop用户名>
> bash bin/start.sh start taskmanager
> ```
>
> 环境变量生效范围参考 <a href="#about-config-env">理解配置项与环境变量的关系</a>

- yarn模式必须连接Hadoop集群,需要配置好Hadoop相关变量 `hadoop.conf.dir` 和 `hadoop.user.name`,详情见[Hadoop相关配置](#hadoop相关配置)。

以下配置的HDFS通常和yarn属于一个Hadoop集群,否则只能使用可直连的`hdfs://`地址。

- `spark.yarn.jars`配置Yarn需要读取的Spark运行jar包地址,必须是`hdfs://`地址。可以上传[OpenMLDB Spark 发行版](../../tutorial/openmldbspark_distribution.md)解压后的`jars`目录到HDFS上,并配置为`hdfs://<hdfs_path>/jars/*`(注意通配符)。[如果不配置该参数,Yarn会将`$SPARK_HOME/jars`打包上传分发,并且每次离线任务都要分发](https://spark.apache.org/docs/3.2.1/running-on-yarn.html#preparations),效率较低,所以推荐配置。
- `batchjob.jar.path`必须是HDFS路径(具体到包名),上传batchjob jar包到HDFS上,并配置为对应地址,保证Yarn集群上所有Worker可以获得batchjob包。
- `offline.data.prefix`必须是HDFS路径,保证Yarn集群上所有Worker可读写数据。应使用前面配置的环境变量`HADOOP_CONF_DIR`中的Hadoop集群地址。
- `offline.data.prefix`必须是HDFS路径,保证Yarn集群上所有Worker可读写数据。

##### yarn-client模式

Expand All @@ -366,3 +337,29 @@ local模式即Spark任务运行在本地(TaskManager所在主机),该模
spark.default.conf=spark.executor.instances=2;spark.executor.memory=2g;spark.executor.cores=2
```
等效于Spark的`--conf`参数,如果提示修改Spark高级参数,请将参数加入此项中。更多参数,参考[Spark 配置](https://spark.apache.org/docs/3.1.2/configuration.html)。

#### Hadoop相关配置

`hadoop.conf.dir`与`hadoop.user.name`属于TaskManager的配置,它们将在TaskManager提交Spark Job时传给Job,等价于创建Spark Job前配置环境变量`HADOOP_CONF_DIR`和`HADOOP_USER_NAME`。

配置项详情:

- `hadoop.conf.dir` 表示Hadoop和Yarn配置文件所在目录(注意该目录是TaskManager节点目录;文件目录中应包含Hadoop的`core-site.xml`、`hdfs-site.xml`, `yarn-site.xml`等配置文件,参考[Spark官方文档](https://spark.apache.org/docs/3.2.1/running-on-yarn.html#launching-spark-on-yarn))。
- `hadoop.user.name` 表示hadoop运行用户。

本质是在配置环境变量,生效范围参考 <a href="#about-config-env">理解配置项与环境变量的关系</a>。如果有特殊需求,可以绕过在TaskManager中配置,用其他方式进行环境变量配置。但建议不要混合使用,只用一种方式更容易调试。

请注意,sbin部署不能传递非指定的变量,目前TaskManager只会传递环境变量 `SPARK_HOME` 和 `RUNNER_JAVA_HOME`。所以如果是sbin部署,尽量使用TaskManager配置方法。

其他配置方法:
- 拷贝Hadoop和Yarn配置文件(`core-site.xml`、`hdfs-site.xml`等)到 `{spark.home}/conf` 目录中。

- TaskManager节点上已有环境变量,或**手动启动TaskManager前**配置环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME`。
> 类似以下步骤:
> ```bash
> cd <openmldb部署根目录>
> export HADOOP_CONF_DIR=<这里替换为Hadoop配置目录>
> export HADOOP_USER_NAME=<这里替换为Hadoop用户名>
> bash bin/start.sh start taskmanager
> ```
> 注意,ssh远程启动可能会丢失环境变量,建议启动前export保证无误。
2 changes: 1 addition & 1 deletion docs/zh/openmldb_sql/ddl/CREATE_INDEX_STATEMENT.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# CREATE INDEX

`CREATE INDEX` 语句用来创建索引。添加索引会发起异步任务来加载数据, 可以通过执行`SHOW JOBS FROM NAMESERVER`来查看任务状态
`CREATE INDEX` 语句用来创建索引。添加索引会发起异步任务来加载数据, 可以通过执行`SHOW JOBS FROM NAMESERVER`来查看任务状态。请注意,异步任务未完成之前,索引不可用,需要新索引的场景会失败。

## 语法

Expand Down
16 changes: 13 additions & 3 deletions docs/zh/quickstart/function_boundary.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,18 @@

通过配置 TaskManager 可以决定离线存储地址 `offline.data.prefix`、离线 job 计算所需 Spark 模式 `spark.master` 等。

`offline.data.prefix`:可配置为文件路径或 HDFS 路径。生产环境建议配置 HDFS 路径,测试环境(特指 onebox 型,例如在 Docker 容器内启动)可以配置本地文件路径。文件路径作为离线存储,将无法支持多 TaskManager 分布式部署(TaskManager 之间不会传输数据)。如果想在多台主机上部署 TaskManager,请使用 HDFS 等多机可同时访问到的存储介质。如果想测试多 TaskManager 工作协同,可以在一台主机上部署多个 TaskManager,此时可以使用文件路径作为离线存储。
`offline.data.prefix`:可配置为文件路径或 HDFS 路径。生产环境建议配置 HDFS 路径,测试环境(特指 onebox 型,例如在 Docker 容器内启动,或所有组件都在一台机器上)可以配置本地文件路径。这是因为,TaskManager提交local的Spark Job,这个Job可以访问到本地文件路径。但这样的话,将无法支持多 TaskManager 分布式部署(TaskManager 之间不会传输数据)。

- 如果想在多台主机上部署 TaskManager,请使用 HDFS 等多机可同时访问到的存储介质。

- 如果想测试多 TaskManager 工作协同,可以在一台主机上部署多个 TaskManager,此时可以使用文件路径作为离线存储。

`spark.master=local[*]`:Spark 默认配置为 `local[*]` 模式(自动绑定 CPU 核数,如果发现离线任务比较慢,建议使用 yarn 模式,改变配置后重启 TaskManager 生效。更多配置可参考 [master-urls](https://spark.apache.org/docs/3.1.2/submitting-applications.htmlmaster-urls)。

### 配置更新

TaskManager除了`spark.default.conf`,其他所有配置都需要重启生效。TaskManager是无状态的,只要不是正在访问中,重启不会有副作用。如果你只需要临时改变离线命令相关的配置,可以不用在TaskManager的配置中更新,可以使用[临时spark配置](#临时spark配置)的方式,只对单个离线任务进行配置调整。

### spark.default.conf

更多可选配置,可以写在 `spark.default.conf` 参数中,格式为 `k1=v1;k2=v2`。例如:
Expand All @@ -26,7 +34,7 @@ spark.default.conf=spark.port.maxRetries=32;foo=bar

### 临时Spark配置

见[客户端Spark配置文件](../reference/client_config/client_spark_config.md),CLI支持临时更改Spark配置,不需要重启TaskManager。但此配置方式不可以改变spark.master等配置。
见[客户端Spark配置文件](../reference/client_config/client_spark_config.md),CLI支持临时更改Spark配置,不需要重启TaskManager。但此配置方式不可以改变spark.master等配置,只能改变`spark.default.conf`中的配置项

## DDL 边界——DEPLOY 语句

Expand Down Expand Up @@ -71,7 +79,9 @@ spark.default.conf=spark.port.maxRetries=32;foo=bar

`LOAD DATA` 无论导入到在线或离线,都是离线 job。源数据的格式规则,离线在线没有区别。

推荐使用 HDFS 文件作为源数据,无论 TaskManager 是 local/yarn 模式,还是 TaskManager 在别的主机上运行,都可以导入。如果源数据为本地文件,是否可以顺利导入需要考虑 TaskManager 模式和运行主机。
推荐使用 HDFS 文件作为源数据,无论 TaskManager 是 local/yarn 模式,还是 TaskManager 在别的主机上运行,都可以导入。

如果源数据为本地文件,是否可以顺利导入需要考虑 TaskManager 模式和运行主机:

- TaskManager 是 local 模式,只有将源数据放在 TaskManager 进程的主机上才能顺利导入。
- TaskManager 是 yarn (client and cluster) 模式时,由于不知道运行容器是哪台主机,不可使用文件路径作为源数据地址。
Expand Down
Loading