This repository offers some tools and utilities to run a benchmark of a set of java-based triple stores, namely Corese, RDF4J and Jena. Its aim is to
- compare Corese with the other triplestore
- compare different versions of Corese
Its principles are
- Focusing on performance measurements, such as loading time, memory used, query time, number of threads/CPU, etc.
- Using native core java libraries instead of server versions of the triplestore. The code is written in Groovy, which is one of java's scripting language available.
- Producing reusable CSV exports of the performance measurements that can be used in other contexts.
- Building upon existing RDF or SPARQL benchmarks such as
- Bowlogna SPARQL Benchmark
- BSBM Berlin SPARQL Benchmark
- DBPedia datasets
- etc.
- The minisite with dynamic versions of the plots is available at ...
- You can also have a look at the image version of the plots in the dashboard folder. See below HOW TO run it section to run locally the benchmark and generate a new version of the plots.
There are 2 main parts of the code:
-
The groovy/java code
- is versionned in
src
folder - processes the input data using the 3 triplestores : loading, and querying (WIP)
- saves the CSV containing the measurements in the
out
folder. Examples of previous run are already given.
- is versionned in
-
workflow automation code written in Python, and versionned in
python-utils
folder. The main steps automatized are :-
- creating
input
folder, downloading and saving input data in it
- creating
-
- launching the
benchmark.groovy
script
- launching the
-
- launching the
plot-compare.py
script which saves the plot files inpublic
folder
- launching the
-
-
2 versions of the workflow are available:
workflow.py
to compare 3 given versions of the 3 triplestoresworkflow-corese-versions.py
to compare 2 or more given version of Corese.
-
The latest results that we version in this repo are visible in the dashboard folder. If you run it by yourself, updated plots will be saved in this folder.
- 1st install dependencies defined in python-utils/environment.yml using conda => see python-utils
- activate python environment
conda activate benchmark_env
- launch the script
(benchmark_env)cd python-utils
(benchmark_env)python workflow.py
# or
(benchmark_env)python workflow-corese-versions.py
For the workflow-corese-versions.py:
- modify as required the versions in the script file, modifying the following line
coreseVersions = ["4.0.1","4.6.3","local"]
- if you want to test with a local version:
- add "local" in the coreseVersions list
- put the jar of the corese-core version in the 'libs' directory
- 1st build the execution environement
./gradlew clean build
- then run it without forgetting to give the path to the input directory, the path to the output directory, and the list of triplestore names, eg:
./gradlew runGroovyScript --args="/path/to/directory /path/to/outdirectory rdf4j.5.1.2,jena.4.10.0,corese.4.6.3"
Assuming the python environment benchmark_env
has been actived:
(benchmark_env)cd python-utils
(benchmark_env)python plot-compare.py
# or optionaly indicating the folder to read the CSV files from
(benchmark_env)python plot-compare.py outputdirectory
It will loop throught the content of the given directory and plots the loading time and memory usage and generate
- a png and html version of the plot
- a index.html file to be used as the dashboard
- Bowlogna becnhmark dataset (from this link)
- Synthetic dataset built according a model describing relations between students, universities, and course programs.
- It's made of 10 files, formally equivalent, but containing each different data. Each file loaded adds ~1.2 million triples
- Total size ~12 Millions triples
- Reference article : SIMPDA2011 paper
- DBPedia is an RDF translation effort of Wikipedia
- We sampled 10 files from the dump folder available online : https://downloads.dbpedia.org/3.5.1/en/
- redirects_en.nt
- disambiguations_en.nt
- homepages_en.nt
- geo_coordinates_en.nt
- instance_types_en.nt
- category_labels_en.nt
- skos_categories_en.nt
- images_en.nt
- specific_mappingbased_properties_en.nt
- persondata_en.nt
- total size is ~20 Millions triples
-
Memory consumption :
- heap : max available to the JVM
- used : used memory (or "used heap") to be measured after gc (garbage collector) call
- calculate the delta of used mem before and after the processes to be tested
- after startup
- before loading data
- after data is loaded
- after query exec
-
Time of execution
- function to use
System.currentTimeMillis()
. Call before and after the process - what time to measure ?
- loading data
- SPARQL query
- select
- count
- function to use
-
Thread : bound to a series of JVM parameters (max, etc)
- nb of threads actually used => Thread.activeCount() (see here)
- other instropection methods from :
-
CPU usage : (NTH-WIP)
-
Nb of infered triples (NTH-WIP) :
- look the named graph used for infered
- inMemoryStore
- inference level (check if levels are comparable) :
- no inference
- RDFS
- format to be parsed :
- nt
- turtle
- trig