Add a story, code examples, CTA

Signed-off-by: Vadim Markovtsev <vadim@sourced.tech>
src-d · Oct 29, 2018 · b6d6fec · b6d6fec
1 parent 2f54497
commit b6d6fec
Showing 1 changed file with 214 additions and 12 deletions.
diff --git a/content/post/asdf.md b/content/post/asdf.md
@@ -7,6 +7,85 @@ description: "What is ASDF, why it is awesome, why you should and how to use it.
 categories: ["technical"]
 ---
 
+There is a project we are developing at source{d} named [Modelforge](https://github.com/src-d/modelforge).
+Its goal is to abstract the serialization and the retrieval of the machine learning
+models from the users. That is, it solves the following two major problems
+every ML engineer hits:
+
+1. What to save a trained model and load it back.
+2. How to reference and distribute trained models.
+
+Let's not go offtopic with discussing (2) - there are many funny subproblems there
+which deserve a dedicated blog post about Modelforge. Instead, the focus will be on
+serialization. The Python crowd is used to solve (1) with
+[`pickle`](https://docs.python.org/3/library/pickle.html) - the built-in
+serialization module. It works nicely while these three conditions hold:
+
+1. The origin of the files is trusted and secure.
+2. The interoperability with other languages is not needed.
+3. The serialized objects are not big.
+
+As some know very well, `pickle` contains a full-featured virtual machine inside,
+which is interpreted to recreate the serialized Python objects. The advantage
+is that we can serialize literally any object, except for few rare types
+like mutexes or executable code. The implied drawbacks are substantial though:
+we should not load a pickle from unreliable source because it is vulnerable
+to remote code execution attacks; the loader must implement a virtual machine,
+which is hard with other programming languages; the memory consumption
+during serialization and deserialization can grow far beyond the original object size.
+The latter is important for machine learning and data science in particular:
+the objects can grow very big, say, bigger than a few hundred megabytes and
+pickling them is very slow and very demanding for the free operating memory.
+We witnessed 2 and even 3 times bigger memory consumption during
+`pickle`-ing which led to funny situtations when one has successfully computed
+the data but cannot dump it on disk so the whole effort becomes useless.
+
+So when we started Modelforge, we searched for the best serialization format
+which is **not** `pickle`.
+
+## Life beyond `pickle`
+
+Our best-shot requirements were:
+
+1. Binary format. It is impossible to efficiently save a huge dense tensor in JSON or YAML.
+2. At the same time, it would be nice to be able to save any JSON-like metadata without pain.
+3. The schema is optional. We did not want to force people to write the schema definitions - it is asking too much from ML researchers. Schema-ful formats separate it from the data and don't include
+into the file, and thus it becomes hard to introspect. Pythonistas take introspection for granted, to say at least.
+4. The typical data size may span over tens of gigabytes. Compared to gigabytes,
+the overhead of including a schema looks irrelevant indeed, hence the files
+become self-descriptive and the schema turns into a validation perk.
+This requirement rules out Protocol Buffers completely, because there can be
+performance and memory problems after 50MB already. It also excludes other
+popular binary serialization formats because they always target messaging and RPC.
+5. Python should be a first-class citizen. numpy arrays should be serializable
+without any additional code.
+6. Yet it should not be nailed to Python. The user-facing applications at source{d}
+are written in Go and otherwise it will be hard to integrate.
+7. On-the-fly compression. NLP models often contain strings which can require much
+space while being perfectly compressable. Integers can be compressed too since
+we sometimes don't know their range beforehand and use 32 bits while only 16 are
+really needed.
+
+[HDF5](https://support.hdfgroup.org/HDF5/) was the closest to those. It is binary,
+there is no schema, it supports big tensors, Python bindings are mature and
+well-integrated, there bindings for other languages. HDF5 is used in e.g. [Keras](https://github.com/keras-team/keras).
+However, it is not ideal in terms of performance and is also morally outdated.
+Read ["Moving away from HDF5"](https://cyrille.rossant.net/moving-away-hdf5/).
+
+We previously had some positive experience with SQLite + SQLAlchemy on top,
+but of course that variant does not stand big data blobs (4).
+There is a common workaround for (4): we can store huge tensors as external files.
+This implies concatenating all the blocks together before uploading and splitting
+them back after downloading, e.g. as a TAR or a ZIP without compression.
+In turn, this means the higher usage complexity, the doubled requirement for free disk size,
+the vulnerability to data corruptions and the exploded number of open file descriptors.
+TensorFlow follows that workaround since Protocol Buffers are completely not
+suitable for storing hundreds of megabytes.
+
+No more intrigue: we discovered ASDF.
+
+## ASDF
+
 [Advanced Scientific Data Format](https://github.com/spacetelescope/asdf) (ASDF)
 is a next generation serialization format for scientific data. This means that
 it focuses on storing sparse and dense tensors in an efficient way.
@@ -15,24 +94,147 @@ at SpaceTelescope Institute in 2014. ASDF is not implementation-driven, it is ba
 the well-defined [standard](https://asdf-standard.readthedocs.io/en/latest/).
 However, there is only one maintained software library written in Python.
 While historically ASDF targetted astronomers, it is actually abstracted away
-from any scientific domain and is completely versatile.
-
-Why another format? Currently, the most popular scientific serialization
-format is [HDF5](https://support.hdfgroup.org/HDF5/). HDF5 was described many years
-ago and is somewhat morally outdated has problems with performance which were
-described in ["Moving away from HDF5"](https://cyrille.rossant.net/moving-away-hdf5/).
-So ASDF tries to take a fresh look on scientific data and account the typical
-modern usage patterns.
-
-## Features
+from any scientific domain and is completely versatile. ASDF features:
 
 * Transparent, automatic, on-the-fly compression and decompression with zlib, bzip2 or lz4.
 * YAML header with binary blocks appended to the tail. All the perks of using YAML are preserved.
-* Uncompressed tensors can be [memory mapped](https://en.wikipedia.org/wiki/Memory-mapped_file) so that the operating memory consumption is very low with tiny performance penalty for sequential read and write. Killer feature if the tensors are big.
+* Uncompressed tensors can be [memory mapped](https://en.wikipedia.org/wiki/Memory-mapped_file) so that the operating memory consumption is very low with tiny performance penalty for sequential read and write. A killer feature if your tensors are big.
 * Data structure can be validated with YAML schemas.
 * Python and numpy arrays are first-class citizens.
 * The tags mechanism allows to extend for new binary data types easily. Though it is rarely needed in practice.
+* The schema is there but is completely optional.
 
 ## Code examples
 
-## ASDF at source{d}
+By default, ASDF reads tensors from disk lazily upon the first reference.
+Thus the opening of an ASDF file is very fast, and it is easy to quickly introspect.
+The contents are always a Python `dict` and they are placed into the `tree` attribute.
+
+```python
+import asdf
+
+with asdf.open("file.asdf") as f:
+    print(f.tree)
+```
+
+Let's create a new ASDF file and see what's inside.
+
+```python
+import io
+
+import asdf
+import numpy
+
+tensor = numpy.ones(10, dtype="int32")
+buffer = io.BytesIO()
+asdf.AsdfFile(tree={
+    "tensor1": tensor,
+    "tensor2": tensor,
+    "meta": "data",
+    "int": 100
+}).write_to(buffer)
+print(buffer.getvalue().decode("utf-8", errors="backslashreplace"))
+```
+
+We should see
+
+```
+#ASDF 1.0.0
+#ASDF_STANDARD 1.2.0
+%YAML 1.1
+%TAG ! tag:stsci.edu:asdf/
+--- !core/asdf-1.1.0
+asdf_library: !core/software-1.0.0 {author: Space Telescope Science Institute, homepage: 'http://github.com/spacetelescope/asdf',
+  name: asdf, version: 2.1.0}
+history:
+  extensions:
+  - !core/extension_metadata-1.0.0
+    extension_class: asdf.extension.BuiltinExtension
+    software: {name: asdf, version: 2.1.0}
+int: 100
+meta: data
+tensor1: !core/ndarray-1.0.0
+  source: 0
+  datatype: int32
+  byteorder: little
+  shape: [10]
+tensor2: !core/ndarray-1.0.0
+  source: 0
+  datatype: int32
+  byteorder: little
+  shape: [10]
+...
+\xd3BLK0(((\x9dZ\x82\xde\xf2_\x8f\x83<B\xaa\xa4g\xde#ASDF BLOCK INDEX
+%YAML 1.1
+--- [615]
+...
+```
+
+Tree items which are not binary - "meta" and "int" - have been written inline in YAML.
+ASDF has been intelligent enough to serialize only one copy of the array.
+It is placed at the end of the file and forms a "block". `!core/ndarray-1.0.0`
+is a tag name which identifies the built-in tensor data type. "source" field
+references the block containing the array elements.
+
+It's also very cool that the block offset indexes table is placed at the bottom
+of the file, so that it is possible to append without rewriting the whole file.
+
+Let's try with compression now.
+
+```python
+import io
+
+import asdf
+import numpy
+
+tensor = numpy.zeros(1000000, dtype="int32")
+buffer = io.BytesIO()
+asdf.AsdfFile(tree={
+    "tensor": tensor,
+}).write_to(buffer, all_array_compression="zlib")
+print(buffer.getvalue().decode("utf-8", errors="backslashreplace"))
+```
+
+zlib algorithm is very efficient at compressing zeros, so the result is expected and awesome:
+
+```
+#ASDF 1.0.0
+#ASDF_STANDARD 1.2.0
+%YAML 1.1
+%TAG ! tag:stsci.edu:asdf/
+--- !core/asdf-1.1.0
+asdf_library: !core/software-1.0.0 {author: Space Telescope Science Institute, homepage: 'http://github.com/spacetelescope/asdf',
+  name: asdf, version: 2.1.0}
+history:
+  extensions:
+  - !core/extension_metadata-1.0.0
+    extension_class: asdf.extension.BuiltinExtension
+    software: {name: asdf, version: 2.1.0}
+tensor: !core/ndarray-1.0.0
+  source: 0
+  datatype: int32
+  byteorder: little
+  shape: [1000000]
+...
+ \xf7Om\xf0n::=	\x90\xb7A\xf9\x84\xe3\xbb@\x8d\x92\xfaI\xf9\xc46x\x9c\xed\xc1
+            \x93#ASDF BLOCK INDEX
+%YAML 1.1
+--- [506]
+...
+```
+
+## Summary
+
+[ASDF](https://github.com/spacetelescope/asdf) is a relatively new hybrid YAML+binary blocks format.
+It is implemented for Python only at the moment, and I really hope that
+the community will help with covering other programming languages (Go please!).
+ASDF is very nice for storing scientific data such as tensors and suits
+machine learning models serialization well. source{d} successfully uses it in
+[Modelforge](https://github.com/src-d/modelforge) -
+a framework to serialize and distribute MLonCode models.
+
+Don't want to miss the next blog post about how source{d} ML team does R&D?
+Subscribe to [our newsletter](http://go.sourced.tech/newsletter), follow
+[@sourcedtech](https://twitter.com/sourcedtech) on Twitter and don't forget
+about our [Paper Reading Club](https://github.com/src-d/reading-club).
+