resolve conflit

qianyun210603 · Nov 27, 2023 · 0397d2f · 0397d2f
2 parents f90db1f + 98f569e
commit 0397d2f
Show file tree

Hide file tree

Showing 17 changed files with 764 additions and 359 deletions.
diff --git a/docs/component/data.rst b/docs/component/data.rst
@@ -52,7 +52,7 @@ Also, ``Qlib`` provides a high-frequency dataset. Users can run a high-frequency
 Qlib Format Dataset
 -------------------
 ``Qlib`` has provided an off-the-shelf dataset in `.bin` format, users could use the script ``scripts/get_data.py`` to download the China-Stock dataset as follows. User can also use numpy to load `.bin` file to validate data.
-The price volume data look different from the actual dealling price because of they are **adjusted** (`adjusted price <https://www.investopedia.com/terms/a/adjusted_closing_price.asp>`_).  And then you may find that the adjusted price may be different from different data sources. This is because different data sources may vary in the way of adjusting prices. Qlib normalize the price on first trading day of each stock to 1 when adjusting them.
+The price volume data look different from the actual dealing price because of they are **adjusted** (`adjusted price <https://www.investopedia.com/terms/a/adjusted_closing_price.asp>`_).  And then you may find that the adjusted price may be different from different data sources. This is because different data sources may vary in the way of adjusting prices. Qlib normalize the price on first trading day of each stock to 1 when adjusting them.
 Users can leverage `$factor` to get the original trading price (e.g. `$close / $factor` to get the original close price).
 
 Here are some discussions about the price adjusting of Qlib. 
@@ -146,7 +146,7 @@ Users can also provide their own data in CSV format. However, the CSV data **mus
             | SH600000  | 120   |
             +-----------+-------+
 
-- CSV file **must** includes a column for the date, and when dumping the data, user must specify the date column name. Here is an example:
+- CSV file **must** include a column for the date, and when dumping the data, user must specify the date column name. Here is an example:
 
     .. code-block:: bash
 

diff --git a/scripts/data_collector/baostock_5min/README.md b/scripts/data_collector/baostock_5min/README.md
@@ -0,0 +1,81 @@
+## Collector Data
+
+### Get Qlib data(`bin file`)
+
+  - get data: `python scripts/get_data.py qlib_data`
+  - parameters:
+    - `target_dir`: save dir, by default *~/.qlib/qlib_data/cn_data_5min*
+    - `version`: dataset version, value from [`v2`], by default `v2`
+      - `v2` end date is *2022-12*
+    - `interval`: `5min`
+    - `region`: `hs300`
+    - `delete_old`: delete existing data from `target_dir`(*features, calendars, instruments, dataset_cache, features_cache*), value from [`True`, `False`], by default `True`
+    - `exists_skip`: traget_dir data already exists, skip `get_data`, value from [`True`, `False`], by default `False`
+  - examples:
+    ```bash
+    # hs300 5min
+    python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/hs300_data_5min --region hs300 --interval 5min
+    ```
+
+### Collector *Baostock high frequency* data to qlib
+> collector *Baostock high frequency* data and *dump* into `qlib` format.
+> If the above ready-made data can't meet users' requirements,  users can follow this section to crawl the latest data and convert it to qlib-data.
+  1. download data to csv: `python scripts/data_collector/baostock_5min/collector.py download_data`
+
+     This will download the raw data such as date, symbol, open, high, low, close, volume, amount, adjustflag from baostock to a local directory. One file per symbol.
+     - parameters:
+          - `source_dir`: save the directory
+          - `interval`: `5min`
+          - `region`: `HS300`
+          - `start`: start datetime, by default *None*
+          - `end`: end datetime, by default *None*
+     - examples:
+          ```bash
+          # cn 5min data
+          python collector.py download_data --source_dir ~/.qlib/stock_data/source/hs300_5min_original --start 2022-01-01 --end 2022-01-30 --interval 5min --region HS300
+          ```
+  2. normalize data: `python scripts/data_collector/baostock_5min/collector.py normalize_data`
+
+     This will:
+     1. Normalize high, low, close, open price using adjclose.
+     2. Normalize the high, low, close, open price so that the first valid trading date's close price is 1. 
+     - parameters:
+          - `source_dir`: csv directory
+          - `normalize_dir`: result directory
+          - `interval`: `5min`
+            > if **`interval == 5min`**, `qlib_data_1d_dir` cannot be `None`
+          - `region`: `HS300`
+          - `date_field_name`: column *name* identifying time in csv files, by default `date`
+          - `symbol_field_name`: column *name* identifying symbol in csv files, by default `symbol`
+          - `end_date`: if not `None`, normalize the last date saved (*including end_date*); if `None`, it will ignore this parameter; by default `None`
+          - `qlib_data_1d_dir`: qlib directory(1d data)
+            if interval==5min, qlib_data_1d_dir cannot be None, normalize 5min needs to use 1d data;
+            ```
+                # qlib_data_1d can be obtained like this:
+                python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/cn_data --interval 1d --region cn --version v3
+            ```
+      - examples:
+        ```bash
+        # normalize 5min cn
+        python collector.py normalize_data --qlib_data_1d_dir ~/.qlib/qlib_data/cn_data --source_dir ~/.qlib/stock_data/source/hs300_5min_original --normalize_dir ~/.qlib/stock_data/source/hs300_5min_nor --region HS300 --interval 5min
+        ```
+  3. dump data: `python scripts/dump_bin.py dump_all`
+    
+     This will convert the normalized csv in `feature` directory as numpy array and store the normalized data one file per column and one symbol per directory. 
+    
+     - parameters:
+       - `csv_path`: stock data path or directory, **normalize result(normalize_dir)**
+       - `qlib_dir`: qlib(dump) data director
+       - `freq`: transaction frequency, by default `day`
+         > `freq_map = {1d:day, 5mih: 5min}`
+       - `max_workers`: number of threads, by default *16*
+       - `include_fields`: dump fields, by default `""`
+       - `exclude_fields`: fields not dumped, by default `"""
+         > dump_fields = `include_fields if include_fields else set(symbol_df.columns) - set(exclude_fields) exclude_fields else symbol_df.columns`
+       - `symbol_field_name`: column *name* identifying symbol in csv files, by default `symbol`
+       - `date_field_name`: column *name* identifying time in csv files, by default `date`
+     - examples:
+       ```bash
+       # dump 5min cn
+       python dump_bin.py dump_all --csv_path ~/.qlib/stock_data/source/hs300_5min_nor --qlib_dir ~/.qlib/qlib_data/hs300_5min_bin --freq 5min --exclude_fields date,symbol
+       ```