-
Notifications
You must be signed in to change notification settings - Fork 1
Cell Processing Workflow
Workflow for Processing Worldclim Variables to CouchDB and AppEngine Datastores
This document provides a broad view of the process to prepare Worldclim for access via the API described in https://github.com/eightysteele/Spatial-Data-Library/wiki/REST-API.
Python 2.6 or later.
GEOS - Geometry Engine - Open Source 3.1.1 or later. Note that starspan up to version 1.2.06 does not compile with Geos3.3.0 or later.
GDAL - Geospatial Data Extraction Library 1.2.6 or later. Includes ogr2ogr for geometry clipping and required to compile starspan.
starspan 1.0.08f is needed to extract the statistics from Worldclim rasters clipped by cell polygons.
CouchDB Python Package 0.8 or later.
Google AppEngine 1.5.3 or later.
##Processing Script: sdl.py The workhorse for bulkloading is https://github.com/eightysteele/Spatial-Data-Library/blob/master/sdl/sdl.py. This script is set up to process one Worldclim tile (see http://www.worldclim.org/tiles.php) or any part of a Worldclim tile with one command-line call. Run the following to see the command line option help:
./sdl.py --help
The basic command to execute the entire bulkloading process for a single tile is -c full
. All other commands are subsets of the process that can be used for convenience or testing. Here's what a full processing call looks like:
./sdl.py -n 120 -u http://localhost:5984 -d worldclim -g ~/Spatial-Data-Library/data/worldclim/Terrestrial-10min-unbuffered-dissolved.shp -b 50000 -k 411 -v ~/Data/SDL/worldclim/411 -w ~/Data/SDL/workspace/Tile411-6 -f 178.66,-49.5 -t 178.84,-49.84 -c full -l none > Tile411-6starspan.log &
-
-k
or--key
- number of the tile to load. Tile numbering follows that given on the Worldclim download page (http://www.worldclim.org/tiles.php), with values from 00 to 411. -
-v
or--vardir
- directory where the Worldclim variables will be downloaded and processed, in this example, ~/Data/SDL/worldclim/411. The script uses curl to get the Worldclim files for all of the variables at 30-second resolution and extracts them into the given directory. -
-w
or--workspace
- workspace directory where temporary files used in processing are stored, including the optional log file. If a tile is being parsed into sections to avoid processing large areas of ocean where there are no data, choose a distinct workspace directory for each section. -
-g
or--gadm
- path to the file to use for clipping. We constructed the fileTerrestrial-10min-unbuffered-dissolved.shp
by first polygonizing, then dissolving the Worldclim tmean6 layer (http://biogeo.ucdavis.edu/data/climate/worldclim/1_4/grid/cur/tmean_10m_bil.zip) at 10 minute resolution. The resulting clipping file is one multipolygon including all 10-minute cells having data. Clipping using this file assures that all cells having data are processed while keeping the complexity of the clipping layer to a minimum to speed processing. Final resulting cells around the borders of this low-resolution clipping layer can still have no values, but these are discarded in thestarspancsvdir2couchcsvs()
function before loading to the data store. -
-u
or--couchurl
- URL for the CouchDB data repository where cell data will be stored for retrieval on demand by App Engine. -
-d
or--database
- database name on CouchDB. -
-n
or--cells-per-degree
- number of cells per degree of longitude at the equator and determines the resolution of the overall grid pattern. This option sets the latitudinal dimension of the cell grid in degrees, while the longitudinal dimension of cells varies with latitude to maintain constant area across all cells on the globe. The value to create cells with a resolution comparable to Worldclim (30 seconds) at the equator is 120. The default geodetic model for the grid is the WGS84 ellipsoid. -
-b
or--batchsize
-batch size - number of cells to process before sending the results to CouchDB in a batch. The higher the number, the more efficient the loading process. In practice we use batch sizes of 50000. -
-f
or--nwcorner
- coordinate (lng,lat) of the northwest corner of the bounding box to process. To process an entire tile, use the coordinates of the bounding box of the tile (e.g., -120,0 -90,30 for tile 32). Processing some tiles can be greatly optimized by runningsdl.py
one or more times with bounding boxes inside the tile that include only terrestrial area. For example, Tile 32 can be processed quickly by runningsdl.py
twice, once for each small bounding box within the tile including only the area of the islands in that region. -
-t
or--secorner
- coordinate (lng,lat) of the southeast corner of the bounding box, used in tandem with the-f
option above. -
-l
or--logfile
- log file in the workspace directory (given by the-w
option) in which to store processing messages. Some commands within thesdl.py
script (such as starspan) are executed as subprocess calls whose output goes to stdout rather than to the log file. Redirect the output to a file if you want to store and review this output after processing. The valuenone
can be supplied to log to the console instead of to a file. -
--config_file
- YAML file containing the App Engine Entity configuration for bulkloading. -
--url
- URL endpoint to the App Engine /remote_api to bulkload to.
##sdl.py commands
As mentioned above under the -c
command line argument, part or all of the cell processing workflow can be executed for a give Tile or section of a Tile. Following are summaries of the different commands available:
-
prepareworkspace
- Checks the workspace directory provided in the-w
argument to see that it is empty. Aborts if it is not to avoid overwriting previously processed data. Otherwise creates and checks that the workspace directory exists. -
getworldclimtile
- Downloads the Worldclim 30-second zipped generic grid files for the Tile given by the-k
argument into the directory specified by the-v
argument, unzips them, and removes the zip file. -
cliptileonly
- Creates a new clipping shapefile called[k]-clipped
, wherek
is the value provided in the-k
argument, in the workspace given by the-w
argument. The shapefile is the intersection of the bounding box provided in the-f
and-t
arguments and the shapefile representing the area having data given by the-g
argument. The resulting clip file is used to reduce the area processed to only those areas having data at the 10-minute resolution of Worldclim. -
batchcells2shapes
- Prepares the workspace directory is in theprepareworkspace
command, clips the bounding box to the area having data as in thecliptileonly
command, creates shapefiles in batches of cells given by the-n
argument, and clips the batches to the area in the bounding box having data. Resulting shapefiles of clipped batches of cells are stored in the/batches
subdirectory of the workspace directory given by the-w
argument. -
starspan
- Extracts statistics on variables in the Worldclim tile for the batches of cells in the/batches
subdirectory of the workspace directory given in the-w
argument. Creates one csv file containing extracted statistics for every batch shape file in the/batches
subdirectory. -
starspan2couch
- Processes the avg statistic from starspan-produced csv files in the/batches
subdirectory of the workspace directory to csv files containing rows with a cell key and a document with all variables for each cell. These csv files are stored in a/forcouch
subdirectory of the given workspace and cover the range of cells in the batch and contain only cells having data - determined by checking that the values of alt (Altitude), bio12 (Annual Precipitation), and tmax1 (Maximum Temperature, January) are not all equal to 0. The cell, document format of the csv files is needed for processing to load to either CouchDB or App Engine. -
tilesection2couchcsvs
- Does everything that the commandbatchcells2shapes
does, followed by the processing achieved by executing thestarspan
andstarspan2couch
commands. Use this command after thegetworldclimtile
command to completely process a Tile or section of a Tile up to the point of having csv files ready for loading to CouchDB and App Engine from the/forcouch
subdirectory of the given workspace directory. -
couchfromcsvs
- Loads the cells from all of the csv files in the/forcouch
subdirectory of the given workspace directory into the CouchDB database given by the-d
argument on the server given by the-u
argument. -
tilecsvs2appengine
- Bulkloads all CSV files for a Tile to an App Engine datastore. Assumes all files for loading are in locations matchingTile[nn-i]/forcouch/*.csv
, wherenn
is the tile key andi
is a tile section run number. The workspace argument-w
should be the directory containing these Tile section subdirectories. Parameters for the App Engine upload must be provided through a YAML config file pointed to by the--config_file
argument. The location of the App Engine instance to load must be provided in the-u
argument. -
full
- Does everything that the commandtilesection2couchcsvs
does, preceded by getting the given Worldclim tile data, and followed by the processing achieved by executing thecouchfromcsvs
command. This command should be used only for processing an complete tile all the way to CouchDB. Does not load to App Engine. To do so, run thecsv2appengine
command after running this command.
##Typical workflow example - Tile01 Tile 01 of Worldclim covers the area within the bounding box defined by (-150, 90), (-120, 60). Tile 01 is typical in having large marine areas where Worldclim data have no values.
Since the bottleneck in data processing is the speed of the variable statistics extraction via starspan (it can take several days to process an entire equatorial tile at a rate of roughly 50k cells per half hour), it is best to minimize the area processed by starspan. For Tile 01, this can be done by dividing up the Tile into three distinct sections around the three separate land masses in the Tile. First, get the Worldclim data with a command such as:
./sdl.py -c getworldclimtile -k 01 -v ~/Data/SDL/worldclim/01
This places the generic Worlclim raster layers for all variables in Tile 01 into the directory Data/SDL/worldclim/01
in the current user's home directory.
Then process each of the sections of the Tile separately with commands such as:
./sdl.py -n 120 -g ~/Spatial-Data-Library/data/worldclim/Terrestrial-10min-unbuffered-dissolved.shp -b 50000 -k 01 -v ~/Data/SDL/worldclim/01 -w ~/Data/SDL/workspace/Tile01-1 -f -123.17,77.17 -t -120,75.66 -c tilesection2couchcsvs -l Tile01-1.log
./sdl.py -n 120 -g ~/Spatial-Data-Library/data/worldclim/Terrestrial-10min-unbuffered-dissolved.shp -b 50000 -k 01 -v ~/Data/SDL/worldclim/01 -w ~/Data/SDL/workspace/Tile01-2 -f -126.01,74.67 -t -120,71 -c tilesection2couchcsvs -l Tile01-2.log
./sdl.py -n 120 -g ~/Spatial-Data-Library/data/gadm/Terrestrial-10min-unbuffered-dissolved.shp -b 50000 -k 01 -v ~/Data/SDL/worldclim/01 -w ~/Data/SDL/workspace/Tile01-3 -f -f -150,70.67 -t -120,60 -c tilesection2couchcsvs -l Tile01-3.log
Load the resulting cells into CouchDB:
./sdl.py -u http://localhost:5984 -d worldclim -w ~/Data/SDL/workspace/Tile01-1 -c couchfromcsvs
./sdl.py -u http://locallost:5984 -d worldclim -w ~/Data/SDL/workspace/Tile01-2 -c couchfromcsvs
./sdl.py -u http://localhost:5984 -d worldclim -w ~/Data/SDL/workspace/Tile01-3 -c couchfromcsvs
and then all three tile section /forcouch
directories for Tile01 into App Engine with one call:
./sdl.py -c tilecsvs2appengine -k 01 -w ~/Data/SDL/workspace -l none --config_file ~/Spatial-Data-Library/bulkloader/bulkloader.yaml --url http://localhost:8080/_ah/remote_api
##CouchDB Utility Script: couchutils.py
This is an additional script to facilitate interactions with CouchDB located at https://github.com/eightysteele/Spatial-Data-Library/blob/master/sdl/couchutils.py. In particular, the bulkdeletedocs
command will delete all of the docs in the given view from the given CouchDB database. For example, here is a call to remove all of the documents in a view called temp/tile32:
python couchutil.py -c bulkdeletedocs -d worldclim -v temp/tile32 -u http://spatial.iriscouch.com:5984 -b 100000 &
Be sure that the view is fully constructed in CouchDB before running the bulkdeletedocs
command.