Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate code snippets in datasets documentation - Part 1 #1962

Merged
merged 43 commits into from
Nov 9, 2022

Conversation

AhdraMeraliQB
Copy link
Contributor

@AhdraMeraliQB AhdraMeraliQB commented Oct 21, 2022

Description

This follows on from #2287. In this PR I have tested all the datasets that can be instantiated locally. The following datasets remain to be tested in part 2:

  • dask.ParquetDataSet
  • holoviews.HoloviewsWriter
  • pandas.GBQQueryDataSet
  • pandas.GBQTableDataSet
  • pandas.SQLQueryDataSet
  • pandas.SQLTableDataSet
  • redis.PickleDataSet
  • all Spark datasets
  • tensorflow.TensorFlowModelDataset

The main changes in this PR are as follows:

  • Some code blocks had duplicate lines for providing filepath arguments. I believe this was to illustrate both local and external (S3/GCS) filesytems, but after discussion with @MerelTheisenQB we have agreed this is a bit redundant.
  • Some comments rendered with weird line breaks. This was fixed.
  • There was a comment for a dependency to be installed within an example code block. This has been moved out to extras_require in setup.py
  • Added missing imports

Also warranting a flag 🚩: GraphMLDataSet and GMLDataSet appear to be identical in purpose and in implementation

Development notes

Both the YAML and Python examples have been tested under the following setup:

  • New conda environment
  • pip install kedro
  • pip install kedro[all]
For Python:

kedro jupyter lab
Code snippets were copied and pasted over
An additional cell was added to check <dataset name>.exists() returned True

For YAML:

kedro new -s spaceflights
The dataset was added to the data catalog in conf/base/catalog.yml and added to the pipeline arbitrarily.

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

@@ -28,17 +28,13 @@ class JSONDataSet(AbstractVersionedDataSet[Any, Any]):
>>> json_dataset:
>>> type: json.JSONDataSet
>>> filepath: data/01_raw/location.json
>>> load_args:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch!

noklam and others added 26 commits October 21, 2022 13:19
* Update release version and release notes

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

* Update missing release notes

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

* update vresion

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

* update release notes

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
* Fix code snippets

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

* Separate code blocks

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

* Lint

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: jstammers <jimmy.stammers@cgastrategy.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
* Update RELEASE.md

* fix broken link

* Update RELEASE.md

Co-authored-by: Merel Theisen <49397448+MerelTheisenQB@users.noreply.github.com>

Co-authored-by: Merel Theisen <49397448+MerelTheisenQB@users.noreply.github.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
* Deprecating `kedro test` and `kedro lint`

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

* Deprecate commands

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

* Make kedro looks prettier

* Update Linting

Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Florian Gaudin-Delrieu <florian.gaudindelrieu@gmail.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
…related issues better (#1881)

* Update message for VersionNotFoundError

Signed-off-by: Ankita Katiyar <110245118+ankatiyar@users.noreply.github.com>

* Add test for VersionNotFoundError for cloud protocols

* Update test_data_catalog.py

Update NoVersionFoundError test

* minor linting update

* update docs link + styling changes

* Revert "update docs link + styling changes"

This reverts commit 6088e00.

* Update test with styling changes

* Update RELEASE.md

Signed-off-by: ankatiyar <ankitakatiyar2401@gmail.com>

Signed-off-by: Ankita Katiyar <110245118+ankatiyar@users.noreply.github.com>
Signed-off-by: ankatiyar <ankitakatiyar2401@gmail.com>
Co-authored-by: Ahdra Merali <90615669+AhdraMeraliQB@users.noreply.github.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
* Add best_practices.md with introductory sections

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add pytest and pytest-cov sections

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add pytest-cov coverage report

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add sections on pytest-cov

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add automated_testing to index.rst

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Reformat third-party library names and clean grammar.

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add link to virtual environment docs

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add example of good test naming

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Improve link accessibility

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Improve pytest docs link accessibility

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add reminder link to virtual environment docs

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Fix formatting in link to coverage docs

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Remove reference to /src under 'Run your tests'

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify references to <project_name> to <package_name>

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Fix sentence structure

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Fix broken databricks doc link

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
* Capitalised kedro-viz

Signed-off-by: yash6318 <yash.agrawal.cse21@iitbhu.ac.in>

* capitalised Kedro viz

Signed-off-by: yash6318 <yash.agrawal.cse21@iitbhu.ac.in>

* Updated set_up_experiment_tracking.md

Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: yash6318 <yash.agrawal.cse21@iitbhu.ac.in>

Signed-off-by: yash6318 <yash.agrawal.cse21@iitbhu.ac.in>
Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Carla Vieira <carlaprv@hotmail.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
* Change a file extention to match the previous article

Signed-off-by: dinotuku <kuan.tung@epfl.ch>

* Add a missing import

Signed-off-by: dinotuku <kuan.tung@epfl.ch>

* Change both preprocessed datasets to parquet files

Signed-off-by: dinotuku <kuan.tung@epfl.ch>

* Change data type to ParquetDataSet for parquet files

Signed-off-by: dinotuku <kuan.tung@epfl.ch>

* Add a note for installing seaborn if it is not installed

Signed-off-by: dinotuku <kuan.tung@epfl.ch>

Signed-off-by: dinotuku <kuan.tung@epfl.ch>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
* Add documentation for linting tools

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* Revert changes to commands_reference.md

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* Update linting docs with suggestions

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* Update linting doc

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
* Create dependabot.yml configuration file

* Update dependabot.yml

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

* add target-branch

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

* Update dependabot.yml

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

* limit dependabot to just dependency folder

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

* Update test_requirements.txt

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

* Update MANIFEST.in

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

* fix e2e

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

* Update continue_config.yml

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

* Update requirements.txt

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

* Update requirements.txt

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

* fix link

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

* revert

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

* Delete requirements.txt

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
SajidAlamQB and others added 9 commits October 21, 2022 13:19
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
* Update dependabot.yml

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

* pin jupyterlab_services to requirments

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

* lint

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Updates the requirements on [pip-tools](https://github.com/jazzband/pip-tools) to permit the latest version.
- [Release notes](https://github.com/jazzband/pip-tools/releases)
- [Changelog](https://github.com/jazzband/pip-tools/blob/master/CHANGELOG.md)
- [Commits](jazzband/pip-tools@6.5.0...6.9.0)

---
updated-dependencies:
- dependency-name: pip-tools
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Updates the requirements on [toposort]() to permit the latest version.

---
updated-dependencies:
- dependency-name: toposort
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Sajid Alam <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
…1953)

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
* remove a redundant function call

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

* Remove redundant resolove_load_version & fix test

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

* Fix HoloviewWriter tests with more specific error message pattern & Lint

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

* Rename tests

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
@merelcht
Copy link
Member

I found the PR where the GraphMLDataSet and GMLDataSet were added: #881 It looks like both of them were added because "Included GraphML in addition to GML. Some tools prefer GraphML as it's able to perfectly preserve internal property maps (ref)."

So let's just keep them both.

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on going through all these examples and testing them! 👍 ⭐

@@ -28,17 +28,13 @@ class JSONDataSet(AbstractVersionedDataSet[Any, Any]):
>>> json_dataset:
>>> type: json.JSONDataSet
>>> filepath: data/01_raw/location.json
>>> load_args:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch!

setup.py Outdated
@@ -78,6 +78,7 @@ def _collect_requirements(requires):
"pandas.XMLDataSet": [PANDAS, "lxml~=4.6"],
"pandas.GenericDataSet": [PANDAS],
}
pickle_require = {"pickle.PickleDataSet": ["compress-pickle~=2.1.0"]}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original comment said: "# Add "compress_pickle[lz4]" to requirements.txt" Do we need to specify [lz4] for this requirement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - this is what the pip list showed after manually installing compress_pickle[lz4], I'll make the change to follow the dask example further up

@AhdraMeraliQB
Copy link
Contributor Author

@MerelTheisenQB

I found the PR where the GraphMLDataSet and GMLDataSet were added: #881 It looks like both of them were added because "Included GraphML in addition to GML. Some tools prefer GraphML as it's able to perfectly preserve internal property maps (ref)."

So let's just keep them both.

I took a look at the ref but couldn't see the preference; is it the name that they prefer? As the implementation is carbon copies 😬

Copy link
Contributor

@ankatiyar ankatiyar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@merelcht
Copy link
Member

I took a look at the ref but couldn't see the preference; is it the name that they prefer? As the implementation is carbon copies 😬

The difference between GraphML and GML (apparently) is that GraphML preserves internal property maps, but GML doesn't. See this comment in the docs: "The only file formats which are capable of perfectly preserving the internal property maps are “gt” and “graphml”. Because of this, they should be preferred over the other formats whenever possible." https://graph-tool.skewed.de/static/doc/graph_tool.html#graph_tool.load_graph

Perhaps the question is whether our users make use of both the datasets or not. It's not an extreme maintenance burden so I'd just keep them both.

This was referenced Nov 7, 2022
@merelcht merelcht mentioned this pull request Nov 7, 2022
5 tasks
@AhdraMeraliQB AhdraMeraliQB self-assigned this Nov 8, 2022
@AhdraMeraliQB AhdraMeraliQB merged commit f52249b into main Nov 9, 2022
@AhdraMeraliQB AhdraMeraliQB deleted the fix/check-code-snippets branch November 9, 2022 08:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.