docs: update conf about hadoop and create index (#3846)

* docs: update conf about hadoop and create index * update en and fix --------- Co-authored-by: Siqi Wang <sqwang9999@gmail.com>
4paradigm · Apr 15, 2024 · d447e46 · d447e46
1 parent 737f294
commit d447e46
Show file tree

Hide file tree

Showing 5 changed files with 88 additions and 68 deletions.
diff --git a/docs/en/deploy/conf.md b/docs/en/deploy/conf.md
@@ -24,11 +24,11 @@
 #--request_max_retry=3
 # Configure the request timeout in milliseconds, the default is 12 seconds
 #--request_timeout_ms=12000
-# Configure the retry interval when the request is unreachable, generally do not need to be modified, in milliseconds
+# Configure the retry interval when the request is unreachable, generally does not need to be modified, in milliseconds
 #--request_sleep_time=1000
 # Configure the zookeeper session timeout in milliseconds
 --zk_session_timeout=10000
-# Configure the zookeeper health check interval, the unit is milliseconds, generally do not need to be modified
+# Configure the zookeeper health check interval, the unit is milliseconds, generally does not need to be modified
 #--zk_keep_alive_check_interval=15000
 # Configure the timeout period for tablet heartbeat detection in milliseconds, the default is 1 minute. If the tablet is still unreachable after this time, the nameserver considers that the tablet is unavailable and will perform the operation of offline the node
 --tablet_heartbeat_timeout=60000
@@ -265,7 +265,6 @@ spark.default.conf=
 spark.eventLog.dir=
 spark.yarn.maxAppAttempts=1
 batchjob.jar.path=
-namenode.uri=
 offline.data.prefix=file:///tmp/openmldb_offline_storage/
 hadoop.conf.dir=
 #enable.hive.support=false
@@ -275,6 +274,7 @@ hadoop.conf.dir=
 
 Some of the important configurations for Spark Config is as follows:
 
+<a id="about-config-env"></a>
 ```{note}
 Understand the relationships between configurations and environment variables.
 
@@ -295,47 +295,60 @@ With one-clock deployment, SPARK_HOME will be set as `<package_home>/spark`. For
 
 `spark.master` configures Spark modes, more information can be found at [Spark Master URL](https://spark.apache.org/docs/latest/submitting-applications.html#master-urls).
 
-
-
 TaskManager only allows `local` and its variants, `yarn`, `yarn-cluster` and `yarn-client` modes. Default mode is `local[*]`, which is milti-process local mode (thread count is cpu counts). Spark cluster `spark://`, Mesos cluster `mesos://` and Kubernetes `k8s://` cluster modes are currently not supported.
 
 ##### `local` Mode
 
-Spark tasks are executed locally on TaskManager deployment machine. Please note the following:
-- `offline.data.prefix` is set by default as `file:///tmp/openmldb_offline_storage/`, which is on TaskManager deployment machine. This can be set to other locations as required. 
-- **Before starting TaskManager**, HDFS path can be configured by setting environment variable `HADOOP_CONF_DIR` to Hadoop configuration directory (Note: it is the environment variable, not the configuration item). The directory needs to include `core-site.xml`, `hdfs-site.xml` configuration files. For more information, refer to [Spark documentation](https://spark.apache.org/docs/3.2.1/configuration.html#inheriting-hadoop-cluster-configuration).
+The local mode means that the Spark task runs on the local machine (where the TaskManager is located). In this mode, not many configurations are required, but two points should be noted:
+- The storage location of offline tables `offline.data.prefix` is set to `file:///tmp/openmldb_offline_storage/` by default, which refers to the `/tmp` directory on the TaskManager's machine. If the TaskManager is moved to another machine, the data cannot be automatically migrated. It is not recommended to use `file://` when deploying multiple TaskManagers on different machines. You can configure it as an HDFS path, and you need to configure the variables `hadoop.conf.dir` and `hadoop.user.name`. For more details, see [Hadoop-related configurations](#hadoop-related-configurations).
 
-```{note}
-Currently, `namenode.uri` needs to be configured. When deleting an offline table, HDFS FileSystem `namenode.uri` will be connected, and offline table path will be deleted. This item will be discarded in future updates.
-```
-- batchjob path `batchjob.jar.path` can be set automativally. It can be configured to other paths.
+- The path of the batchjob `batchjob.jar.path` can be automatically obtained and does not need to be configured. If you want to use a batchjob from elsewhere, you can configure this parameter.
 
 ```{seealso}
-if Hadoop/Yarm requires Kerberos authentication, refer to [FAQ](../faq.md).
+If Hadoop/Yarn requires Kerberos authentication, refer to the [Client FAQ](../faq/client_faq.md).
 ```
+
 ##### `yarn/yarn-cluster` Mode
+"yarn" and "yarn-cluster" are the same mode, where Spark tasks run on a Yarn cluster. This mode requires several configurations, including:
+
+- The yarn mode must connect to a Hadoop cluster and requires the proper configuration of Hadoop variables `hadoop.conf.dir` and `hadoop.user.name`. For more details, refer to [Hadoop-related configurations](#hadoop-related-configurations).
 
-`yarn` and `yarn-cluster` is the same mode, where Spark tasks execute on the Yarn cluster. The configuration items are:
+The following configurations usually require an HDFS that belongs to the same Hadoop cluster as Yarn, unless a direct `hdfs://` address can be used.
 
-- **Before starting TaskManager**, configure environment variable `HADOOP_CONF_DIR` to Hadoop and Yarn configuration directory. The directory should include `core-site.xml` and `hdfs-site.xml` for hadoop, and `yarn-site.xml` for Yarn. For more details, refer to [Spark documentation](https://spark.apache.org/docs/3.2.1/running-on-yarn.html#launching-spark-on-yarn).
+- The `spark.yarn.jars` configuration specifies the location of Spark runtime JAR files that Yarn needs to read. It must be an `hdfs://` address. You can upload the `jars` directory from the [OpenMLDB Spark distribution](../../tutorial/openmldbspark_distribution.md) to HDFS and configure it as `hdfs://<hdfs_path>/jars/*` (note the wildcard). [If this parameter is not configured, Yarn will package and distribute `$SPARK_HOME/jars` for each offline task, which is inefficient](https://spark.apache.org/docs/3.2.1/running-on-yarn.html#preparations). Therefore, it is recommended to configure this parameter.
+- `batchjob.jar.path` must be an HDFS path (specific to the package name). Upload the batch job JAR file to HDFS and configure it with the corresponding address to ensure that all workers in the Yarn cluster can access the batch job package.
+- `offline.data.prefix` must be an HDFS path to ensure that all workers in the Yarn cluster can read and write data.
 
-- `spark.yarn.jars` Configure Spark jar for Yarn. It has to be a `hdfs://` path. You can update the `jars` directory from [OpenMLDB Spark distribution](../../tutorial/openmldbspark_distribution.md) to HDFS, and set it as `hdfs://<hdfs_path>/jars/*`. If not set, Yarn will package and distribute `$SPARK_HOME/jars`, and will do so [for each offline task](https://spark.apache.org/docs/3.2.1/running-on-yarn.html#preparations). Therefore, we recommend it to be set.
+##### `yarn-client` Mode
 
-- `batchjob.jar.path` It has to be a HDFS path. Upload batchjob jar to HDFS and configure the respective path. Make sure that all Workers in Yarn cluster have access to the batchjob jar.
+[Driver executes locally](https://spark.apache.org/docs/3.2.1/running-on-yarn.html#launching-spark-on-yarn), and the executor executes on the Yarn cluster. Configurations are the same as `yarn-cluster`.
 
-- `offline.data.prefix` It has to be a HDFS path. Make sure that all Workers in Yarn cluster have access. Use the environment variable `HADOOP_CONF_DIR`.
+#### spark.default.conf
 
+`spark.default.conf` configures Spark parameters in the format of `key=value`. Multiple configurations are separated by `;`, for example:
 
-##### `yarn-client` Mode
+#### Hadoop-related configurations
 
-[Driver executes locally](https://spark.apache.org/docs/3.2.1/running-on-yarn.html#launching-spark-on-yarn)，and executor execute on the Yarn cluster. Configurations are the same as `yarn-cluster`.
+`hadoop.conf.dir` and `hadoop.user.name` are configurations for TaskManager. They will be passed to the Spark Job when TaskManager submits the job, which is equivalent to configuring the environment variables `HADOOP_CONF_DIR` and `HADOOP_USER_NAME` before creating the Spark Job.
 
-#### `spark.default.conf`
+Details of the configurations:
 
-Format is `key=value`, use `;` to separate. For example:
+- `hadoop.conf.dir` represents the directory where Hadoop and Yarn configuration files are located (note that this directory is on the TaskManager node; the file directory should include Hadoop's `core-site.xml`, `hdfs-site.xml`, `yarn-site.xml`, and other configuration files, refer to the [Spark official documentation](https://spark.apache.org/docs/3.2.1/running-on-yarn.html#launching-spark-on-yarn)).
+- `hadoop.user.name` represents the Hadoop user.
 
-```
-spark.default.conf=spark.executor.instances=2;spark.executor.memory=2g;spark.executor.cores=2
-```
-Same affect as `--conf`. For more configurations, refer to [Spark documentation](https://spark.apache.org/docs/3.1.2/configuration.html).
+Essentially, it configures environment variables, and the scope of their effect is explained in <a href="#about-config-env">Understanding the Relationship Between Configurations and Environment Variables</a>. If there are special requirements, it is possible to bypass the configuration in TaskManager and configure the environment variables in other ways. However, it is recommended not to mix the two methods and use only one method for easier debugging.
+
+Please note that unspecified variables cannot be passed in sbin deployment. Currently, TaskManager only receives the environment variables `SPARK_HOME` and `RUNNER_JAVA_HOME`. Therefore, if you are using sbin deployment, it is recommended to use the TaskManager configuration file.
+
+Other configuration methods:
+- Copy Hadoop and Yarn configuration files (`core-site.xml`, `hdfs-site.xml`, etc.) to the `{spark.home}/conf` directory.
 
+- If there are existing environment variables on the TaskManager node, or **before manually starting TaskManager**, configure the environment variables `HADOOP_CONF_DIR` and `HADOOP_USER_NAME`.
+  > Similar to the following steps:
+  > ```bash
+  > cd <openmldb installation directory>
+  > export HADOOP_CONF_DIR=<replace with the Hadoop configuration directory>
+  > export HADOOP_USER_NAME=<replace with the Hadoop username>
+  > bash bin/start.sh start taskmanager
+  > ```
+  > Note that SSH remote startup may lose environment variables, so it is recommended to export them correctly before starting.
diff --git a/docs/en/openmldb_sql/ddl/CREATE_INDEX_STATEMENT.md b/docs/en/openmldb_sql/ddl/CREATE_INDEX_STATEMENT.md
@@ -1,6 +1,6 @@
 # CREATE INDEX
 
-The `CREATE INDEX` statement is used to create a new index on existing table. Running `CREATE INDEX` will initiates an asynchronous job, and you can check the status of the job by executing `SHOW JOBS FROM NAMESERVER`.
+The `CREATE INDEX` statement is used to create a new index on an existing table. Running `CREATE INDEX` initiates an asynchronous job, and you can check the status of the job by executing `SHOW JOBS FROM NAMESERVER`. Please note that the index is not available until the asynchronous task is completed, and any scenarios that require the new index will fail.
 
 ## Syntax
 

diff --git a/docs/zh/deploy/conf.md b/docs/zh/deploy/conf.md
@@ -270,7 +270,6 @@ spark.default.conf=
 spark.eventLog.dir=
 spark.yarn.maxAppAttempts=1
 batchjob.jar.path=
-namenode.uri=
 offline.data.prefix=file:///tmp/openmldb_offline_storage/
 hadoop.conf.dir=
 hadoop.user.name=
@@ -308,52 +307,24 @@ TaskManager只接受`local`及其变种、`yarn`、`yarn-cluster`、`yarn-client
 ##### local模式
 
 local模式即Spark任务运行在本地（TaskManager所在主机），该模式下不需要太多配置，只需要注意两点：
-- 离线表的存储地址`offline.data.prefix`，默认为`file:///tmp/openmldb_offline_storage/`，即TaskManager所在主机的`/tmp`目录，你可以修改该配置为其他目录。
-  - 可以配置为HDFS路径，如果配置为HDFS路径，需要正确配置变量 `hadoop.conf.dir` 和 `hadoop.user.name`，其中 `hadoop.conf.dir` 表示Hadoop配置文件所在目录（注意该目录是TaskManager节点目录；文件目录中应包含Hadoop的`core-site.xml`、`hdfs-site.xml`等配置文件，更多见[Spark官方文档](https://spark.apache.org/docs/3.2.1/configuration.html#inheriting-hadoop-cluster-configuration)），`hadoop.user.name` 表示hadoop运行用户，可以通过以下三种方式之一配置这两个变量：
-    1. 在 `conf/taskmanager.properties` 配置文件中配置变量 `hadoop.conf.dir`, `hadoop.user.name`
-    2. 在(TaskManager节点)**启动TaskManager前**配置环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME`
-    3. 拷贝Hadoop配置文件（`core-site.xml`、`hdfs-site.xml`等）到 `{spark.home}/conf` 目录中
-    > sbin部署不能传递非指定的变量，目前TaskManager只会传递环境变量 `SPARK_HOME` 和 `RUNNER_JAVA_HOME`。所以如果是sbin部署，尽量使用第一种方法。
-    > 
-    > 如果使用第二种方式，配置的环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME` 最好是永久生效的，如果不希望环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME` 永久生效，可以在一个session里，先临时配置环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME` ，然后启动TaskManager，例如
-    > ```bash
-    > cd <openmldb部署根目录>
-    > export HADOOP_CONF_DIR=<这里替换为Hadoop配置目录>
-    > export HADOOP_USER_NAME=<这里替换为Hadoop用户名>
-    > bash bin/start.sh start taskmanager
-    > ```
-    >
-    > 环境变量生效范围参考 <a href="#about-config-env">理解配置项与环境变量的关系</a>
-  ```{note}
-  HDFS路径目前需要配置`namenode.uri`，删除离线表时会连接HDFS FileSystem`namenode.uri`，并删除离线表的存储目录（Offline Table Path）。未来将废弃此配置项。
-  ```
+- 离线表的存储地址`offline.data.prefix`，默认为`file:///tmp/openmldb_offline_storage/`，即TaskManager所在主机的`/tmp`目录。如果TaskManager换机器，数据无法自动迁移，多机部署TaskManager时不建议使用`file://`。可以配置为HDFS路径，需要配置变量 `hadoop.conf.dir` 和 `hadoop.user.name`，详情见[Hadoop相关配置](#hadoop相关配置)。
+
 - batchjob的路径`batchjob.jar.path`可自动获取，无需配置，如果你要使用别处的batchjob，可以配置该参数。
 
 ```{seealso}
 如果Hadoop/Yarn需要Kerberos认证，参考[FAQ](../faq/client_faq.md#如何配置taskmanager来访问开启kerberos的yarn集群)。
 ```
 
-
 ##### yarn/yarn-cluster模式
 "yarn"和"yarn-cluster"是同一个模式，即Spark任务运行在Yarn集群上，该模式下需要配置的参数较多，主要包括：
-- 正确配置变量 `hadoop.conf.dir` 和 `hadoop.user.name`，其中 `hadoop.conf.dir` 表示Hadoop和Yarn配置文件所在目录（注意该目录是TaskManager节点目录；文件目录中应包含Hadoop的`core-site.xml`、`hdfs-site.xml`, `yarn-site.xml`等配置文件，参考[Spark官方文档](https://spark.apache.org/docs/3.2.1/running-on-yarn.html#launching-spark-on-yarn)），`hadoop.user.name` 表示hadoop运行用户，可以通过以下三种方式之一配置这两个变量：
-  1. 在 `conf/taskmanager.properties` 配置文件中配置变量 `hadoop.conf.dir`, `hadoop.user.name`
-  2. 在(TaskManager节点)**启动TaskManager前**配置环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME`
-  3. 拷贝Hadoop和Yarn配置文件（`core-site.xml`、`hdfs-site.xml`等）到 `{spark.home}/conf` 目录中
-  > sbin部署不能传递非指定的变量，目前TaskManager只会传递环境变量 `SPARK_HOME` 和 `RUNNER_JAVA_HOME`。所以如果是sbin部署，尽量使用第一种方法。
-  >
-  > 如果使用第二种方式，配置的环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME` 最好是永久生效的，如果不希望环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME` 永久生效，可以在一个session里，先临时配置环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME` ，然后启动TaskManager，例如
-  > ```bash
-    > cd <openmldb部署根目录>
-    > export HADOOP_CONF_DIR=<这里替换为Hadoop配置目录>
-    > export HADOOP_USER_NAME=<这里替换为Hadoop用户名>
-    > bash bin/start.sh start taskmanager
-    > ```
-  >
-  > 环境变量生效范围参考 <a href="#about-config-env">理解配置项与环境变量的关系</a>
+
+- yarn模式必须连接Hadoop集群，需要配置好Hadoop相关变量 `hadoop.conf.dir` 和 `hadoop.user.name`，详情见[Hadoop相关配置](#hadoop相关配置)。
+
+以下配置的HDFS通常和yarn属于一个Hadoop集群，否则只能使用可直连的`hdfs://`地址。
+
 - `spark.yarn.jars`配置Yarn需要读取的Spark运行jar包地址，必须是`hdfs://`地址。可以上传[OpenMLDB Spark 发行版](../../tutorial/openmldbspark_distribution.md)解压后的`jars`目录到HDFS上，并配置为`hdfs://<hdfs_path>/jars/*`（注意通配符）。[如果不配置该参数，Yarn会将`$SPARK_HOME/jars`打包上传分发，并且每次离线任务都要分发](https://spark.apache.org/docs/3.2.1/running-on-yarn.html#preparations)，效率较低，所以推荐配置。
 - `batchjob.jar.path`必须是HDFS路径（具体到包名），上传batchjob jar包到HDFS上，并配置为对应地址，保证Yarn集群上所有Worker可以获得batchjob包。
-- `offline.data.prefix`必须是HDFS路径，保证Yarn集群上所有Worker可读写数据。应使用前面配置的环境变量`HADOOP_CONF_DIR`中的Hadoop集群地址。
+- `offline.data.prefix`必须是HDFS路径，保证Yarn集群上所有Worker可读写数据。
 
 ##### yarn-client模式
 
@@ -366,3 +337,29 @@ local模式即Spark任务运行在本地（TaskManager所在主机），该模
 spark.default.conf=spark.executor.instances=2;spark.executor.memory=2g;spark.executor.cores=2
 ```
 等效于Spark的`--conf`参数，如果提示修改Spark高级参数，请将参数加入此项中。更多参数，参考[Spark 配置](https://spark.apache.org/docs/3.1.2/configuration.html)。
+
+#### Hadoop相关配置
+
+`hadoop.conf.dir`与`hadoop.user.name`属于TaskManager的配置，它们将在TaskManager提交Spark Job时传给Job，等价于创建Spark Job前配置环境变量`HADOOP_CONF_DIR`和`HADOOP_USER_NAME`。
+
+配置项详情：
+
+- `hadoop.conf.dir` 表示Hadoop和Yarn配置文件所在目录（注意该目录是TaskManager节点目录；文件目录中应包含Hadoop的`core-site.xml`、`hdfs-site.xml`, `yarn-site.xml`等配置文件，参考[Spark官方文档](https://spark.apache.org/docs/3.2.1/running-on-yarn.html#launching-spark-on-yarn)）。
+- `hadoop.user.name` 表示hadoop运行用户。
+
+本质是在配置环境变量，生效范围参考 <a href="#about-config-env">理解配置项与环境变量的关系</a>。如果有特殊需求，可以绕过在TaskManager中配置，用其他方式进行环境变量配置。但建议不要混合使用，只用一种方式更容易调试。
+
+请注意，sbin部署不能传递非指定的变量，目前TaskManager只会接收到环境变量 `SPARK_HOME` 和 `RUNNER_JAVA_HOME`。所以如果是sbin部署，尽量使用TaskManager配置文件。
+
+其他配置方法：
+- 拷贝Hadoop和Yarn配置文件（`core-site.xml`、`hdfs-site.xml`等）到 `{spark.home}/conf` 目录中。
+
+- TaskManager节点上已有环境变量，或**手动启动TaskManager前**配置环境变量 `HADOOP_CONF_DIR`, `HADOOP_USER_NAME`。
+  > 类似以下步骤：
+  > ```bash
+  > cd <openmldb部署根目录>
+  > export HADOOP_CONF_DIR=<这里替换为Hadoop配置目录>
+  > export HADOOP_USER_NAME=<这里替换为Hadoop用户名>
+  > bash bin/start.sh start taskmanager
+  > ```
+  > 注意，ssh远程启动可能会丢失环境变量，建议启动前export保证无误。
diff --git a/docs/zh/openmldb_sql/ddl/CREATE_INDEX_STATEMENT.md b/docs/zh/openmldb_sql/ddl/CREATE_INDEX_STATEMENT.md
@@ -1,6 +1,6 @@
 # CREATE INDEX
 
-`CREATE INDEX` 语句用来创建索引。添加索引会发起异步任务来加载数据, 可以通过执行`SHOW JOBS FROM NAMESERVER`来查看任务状态
+`CREATE INDEX` 语句用来创建索引。添加索引会发起异步任务来加载数据, 可以通过执行`SHOW JOBS FROM NAMESERVER`来查看任务状态。请注意，异步任务未完成之前，索引不可用，需要新索引的场景会失败。
 
 ## 语法