Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance README badges #59

Merged
merged 4 commits into from
Mar 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
132 changes: 65 additions & 67 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,13 @@
[![Stable Version](https://poser.pugx.org/jbzoo/csv-blueprint/version)](https://packagist.org/packages/jbzoo/csv-blueprint/) [![Total Downloads](https://poser.pugx.org/jbzoo/csv-blueprint/downloads)](https://packagist.org/packages/jbzoo/csv-blueprint/stats) [![Docker Pulls](https://img.shields.io/docker/pulls/jbzoo/csv-blueprint.svg)](https://hub.docker.com/r/jbzoo/csv-blueprint) [![Dependents](https://poser.pugx.org/jbzoo/csv-blueprint/dependents)](https://packagist.org/packages/jbzoo/csv-blueprint/dependents?order_by=downloads) [![GitHub License](https://img.shields.io/github/license/jbzoo/csv-blueprint)](https://github.com/JBZoo/Csv-Blueprint/blob/master/LICENSE)

<!-- rules-counter -->
![Static Badge](https://img.shields.io/badge/Rules-100-green?label=Total%20Number%20of%20Rules&labelColor=blue&color=gray) ![Static Badge](https://img.shields.io/badge/Rules-55-green?label=Cell%20Rules&labelColor=blue&color=gray) ![Static Badge](https://img.shields.io/badge/Rules-45-green?label=Aggregate%20Rules&labelColor=blue&color=gray)
[![Static Badge](https://img.shields.io/badge/Rules-102-green?label=Total%20Number%20of%20Rules&labelColor=blue&color=gray)](schema-examples/full.yml) [![Static Badge](https://img.shields.io/badge/Rules-55-green?label=Cell%20Rules&labelColor=blue&color=gray)](src/Rules/Cell) [![Static Badge](https://img.shields.io/badge/Rules-45-green?label=Aggregate%20Rules&labelColor=blue&color=gray)](src/Rules/Aggregate) [![Static Badge](https://img.shields.io/badge/Rules-2-green?label=Extra%20Checks&labelColor=blue&color=gray)](schema-examples/full.yml)
<!-- /rules-counter -->

## Introduction

The CSV Blueprint tool is a powerful and flexible utility designed for validating CSV files against
a predefined schema specified in YAML format. With the capability to run both locally and in Docker environments,
a pre-defined schema specified in YAML format. With the capability to run both locally and in Docker environments,
CSV Blueprint is an ideal choice for integrating into CI/CD pipelines, such as GitHub Actions,
to ensure the integrity of CSV data in your projects.

Expand All @@ -35,7 +35,7 @@ Integrating CSV validation into CI processes promotes higher data integrity, rel
* **Comprehensive Rule Set**: Includes a broad set of validation rules, such as non-empty fields, exact values, regular expressions, numeric constraints, date formats, and more, catering to various data validation needs.
* **Docker Support**: Easily integrate into any workflow with Docker, providing a seamless experience for development, testing, and production environments.
* **GitHub Actions Integration**: Automate CSV validation in your CI/CD pipeline, enhancing the quality control of your data in pull requests and deployments.
* **Various ways to report:** issues that can be easily integrated with GithHub, Gitlab, TeamCity, etc. The default output is a human-readable table. [See Live Demo](https://github.com/JBZoo/Csv-Blueprint-Demo).
* **Various ways to report:** issues that can be easily integrated with GitHub, Gitlab, TeamCity, etc. The default output is a human-readable table. [See Live Demo](https://github.com/JBZoo/Csv-Blueprint-Demo).


## Live Demo
Expand All @@ -51,8 +51,8 @@ Integrating CSV validation into CI processes promotes higher data integrity, rel
* [demo.csv](tests/fixtures/demo.csv)


### Schema Definition
Define your CSV validation schema in a [YAML](schema-examples/full.yml). Other formats are also available: , [JSON](schema-examples/full.json), [PHP](schema-examples/full.php).
### Schema definition
Define your CSV validation schema in a [YAML](schema-examples/full.yml). Other formats are also available: [JSON](schema-examples/full.json), [PHP](schema-examples/full.php).

This example defines a simple schema for a CSV file with a header row, specifying that the `id` column must not be empty and must contain integer values.
Also, it checks that the `name` column has a minimum length of 3 characters.
Expand All @@ -74,7 +74,7 @@ columns:
```


### Full description of the scheme
### Full description of the schema

In the [example Yml file](schema-examples/full.yml) you can find a detailed description of all features.
It's also covered by tests, so it's always up-to-date.
Expand Down Expand Up @@ -351,7 +351,7 @@ columns:
You can find launch examples in the [workflow demo](https://github.com/JBZoo/Csv-Blueprint/actions/workflows/demo.yml).


### As GitHub Action
### GitHub Action

<!-- github-actions-yml -->
```yml
Expand Down Expand Up @@ -383,7 +383,7 @@ You can find launch examples in the [workflow demo](https://github.com/JBZoo/Csv
```
<!-- /github-actions-yml -->

**Note**. Report format for GitHub Actions is `github` by default. See [GitHub Actions friendly](https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions#setting-a-warning-message) and [PR as a live demo](https://github.com/JBZoo/Csv-Blueprint-Demo/pull/1/files).
**Note**. GitHub Actions report format is `github` by default. See [GitHub Actions friendly](https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions#setting-a-warning-message) and [PR as a live demo](https://github.com/JBZoo/Csv-Blueprint-Demo/pull/1/files).

This allows you to see bugs in the GitHub interface at the PR level.
That is, the error will be shown in a specific place in the CSV file right in diff of your Pull Requests! [See example](https://github.com/JBZoo/Csv-Blueprint-Demo/pull/1/files).
Expand All @@ -398,7 +398,7 @@ That is, the error will be shown in a specific place in the CSV file right in di
</details>


### As Docker container
### Docker container
Ensure you have Docker installed on your machine.

```sh
Expand All @@ -408,15 +408,15 @@ docker pull jbzoo/csv-blueprint
# Run the tool inside Docker
docker run --rm \
--workdir=/parent-host \
-v .:/parent-host \
-v $(pwd):/parent-host \
jbzoo/csv-blueprint \
validate:csv \
--csv=./tests/fixtures/demo.csv \
--schema=./tests/schemas/demo_invalid.yml
```


### As PHP binary
### PHP binary
Ensure you have PHP installed on your machine.

**Status: WIP**. It's not released yet. But you can build it from source. See manual above and `./build/csv-blueprint.phar` file.
Expand All @@ -430,7 +430,7 @@ chmod +x ./csv-blueprint.phar
```


### As PHP project
### PHP project
Ensure you have PHP installed on your machine.
Then, you can use the following commands to build from source and run the tool.

Expand All @@ -444,7 +444,7 @@ make build
```


### CLI Help Message
### Complete CLI Help Message

Here you can see all available options and commands. Tool uses [JBZoo/Cli](https://github.com/JBZoo/Cli) package for the CLI interface.
So there are options here for all occasions.
Expand Down Expand Up @@ -559,55 +559,54 @@ Optional format `text` with highlited keywords:

It's random ideas and plans. No orderings and deadlines. <u>But batch processing is the priority #1</u>.

**Batch processing**
* If option `--csv` is not specified, then the STDIN is used. To build a pipeline in Unix-like systems.
* Discovering CSV files by `filename_pattern` in the schema file. In case you have a lot of schemas and a lot of CSV files and want to automate the process as one command.
* Flag to ignore file name pattern. It's useful when you have a lot of files, and you don't want to validate the file name.

**Validation**
* [More aggregate rules](https://github.com/markrogoyski/math-php#statistics---descriptive).
* [More cell rules](https://github.com/Respect/Validation).
* `required` flag for the column.
* Custom cell rule as a callback. It's useful when you have a complex rule that can't be described in the schema file.
* Custom agregate rule as a callback. It's useful when you have a complex rule that can't be described in the schema file.
* Configurable keyword for null/empty values. By default, it's an empty string. But you will use `null`, `nil`, `none`, `empty`, etc. Overridable on the column level.
* Handle empty files and files with only a header row, or only with one line of data. One column wthout header is also possible.
* Using multiple schemas for one csv file.
* Inheritance of schemas, rules and columns. Define parent schema and override some rules in the child schemas. Make it DRY and easy to maintain.
* Validate syntax and options in the schema file. It's important to know if the schema file is valid and can be used for validation.
* If option `--schema` is not specified, then validate only super base level things (like "is it a CSV file?").
* Complex rules (like "if field `A` is not empty, then field `B` should be not empty too").
* Extending with custom rules and custom report formats. Plugins?
* Input encoding detection + `BOM` (right now it's experimental). It works but not so accurate... UTF-8/16/32 is the best choice for now.

**Release workflow**
* Build and release Docker image [via GitHub Actions, tags and labels](https://docs.docker.com/build/ci/github-actions/manage-tags-labels/). Review it.
* Build phar file and release via GitHub Actions.
* Auto insert tool version into the Docker image and phar file. It's important to know the version of the tool you are using.
* Show version as part of output.

**Performance and optimization**
* Benchmarks as part of the CI(?) and Readme. It's important to know how much time the validation process takes.
* Optimazation on `php.ini` level to start it faster. JIT, opcache, preloading, etc.
* Parallel validation of really-really large files (1GB+ ?). I know you have them and not so much memory.
* Parallel validation of multiple files at once.

**Mock data generation**
* Create CSV files based on the schema (like "create 1000 rows with random data based on schema and rules").
* Use [Faker](https://github.com/FakerPHP/Faker) for random data generation.

**Reporting**
* More report formats (like JSON, XML, etc). Any ideas?
* Gitlab and JUnit reports must be as one structure. It's not so easy to implement. But it's a good idea.
* Merge reports from multiple CSV files into one report. It's useful when you have a lot of files and you want to see all errors in one place. Especially for GitLab and JUnit reports.

**Misc**
* Use it as PHP SDK. Examples in Readme.
* Warnings about deprecated options and features.
* Warnings about invalid schema files.
* Move const:HELP to PHP annotations. Canonic way to describe the command.
* S3 Storage support. Validate files in the S3 bucket?
* More examples and documentation.
* **Batch processing**
* If option `--csv` is not specified, then the STDIN is used. To build a pipeline in Unix-like systems.
* Discovering CSV files by `filename_pattern` in the schema file. In case you have a lot of schemas and a lot of CSV files and want to automate the process as one command.
* Flag to ignore file name pattern. It's useful when you have a lot of files, and you don't want to validate the file name.

* **Validation**
* [More aggregate rules](https://github.com/markrogoyski/math-php#statistics---descriptive).
* [More cell rules](https://github.com/Respect/Validation).
* `required` flag for the column.
* Custom cell rule as a callback. It's useful when you have a complex rule that can't be described in the schema file.
* Custom agregate rule as a callback. It's useful when you have a complex rule that can't be described in the schema file.
* Configurable keyword for null/empty values. By default, it's an empty string. But you will use `null`, `nil`, `none`, `empty`, etc. Overridable on the column level.
* Handle empty files and files with only a header row, or only with one line of data. One column wthout header is also possible.
* Using multiple schemas for one csv file.
* Inheritance of schemas, rules and columns. Define parent schema and override some rules in the child schemas. Make it DRY and easy to maintain.
* If option `--schema` is not specified, then validate only super base level things (like "is it a CSV file?").
* Complex rules (like "if field `A` is not empty, then field `B` should be not empty too").
* Extending with custom rules and custom report formats. Plugins?
* Input encoding detection + `BOM` (right now it's experimental). It works but not so accurate... UTF-8/16/32 is the best choice for now.

* **Release workflow**
* Build and release Docker image [via GitHub Actions, tags and labels](https://docs.docker.com/build/ci/github-actions/manage-tags-labels/). Review it.
* Build phar file and release via GitHub Actions.
* Auto insert tool version into the Docker image and phar file. It's important to know the version of the tool you are using.
* Show version as part of output.

* **Performance and optimization**
* Benchmarks as part of the CI(?) and Readme. It's important to know how much time the validation process takes.
* Optimization on `php.ini` level to start it faster. JIT, opcache, preloading, etc.
* Parallel validation of really-really large files (1GB+ ?). I know you have them and not so much memory.
* Parallel validation of multiple files at once.

* **Mock data generation**
* Create CSV files based on the schema (like "create 1000 rows with random data based on schema and rules").
* Use [Faker](https://github.com/FakerPHP/Faker) for random data generation.

* **Reporting**
* More report formats (like JSON, XML, etc). Any ideas?
* Gitlab and JUnit reports must be as one structure. It's not so easy to implement. But it's a good idea.
* Merge reports from multiple CSV files into one report. It's useful when you have a lot of files and you want to see all errors in one place. Especially for GitLab and JUnit reports.

* **Misc**
* Use it as PHP SDK. Examples in Readme.
* Warnings about deprecated options and features.
* Warnings about invalid schema files.
* Move const:HELP to PHP annotations. Canonic way to describe the command.
* S3 Storage support. Validate files in the S3 bucket?
* More examples and documentation.


PS. [There is a file](tests/schemas/todo.yml) with my ideas and imagination. It's not valid schema file, just a draft.
Expand All @@ -616,12 +615,12 @@ I'm not sure if I will implement all of them. But I will try to do my best.

## Disadvantages?

There is a perception that PHP is a slow language. I don't agree with that. You just need to know how to prepare it.
It is perceived that PHP is a slow language. I don't agree with that. You just need to know how to prepare it.
See [Processing One Billion CSV rows in PHP!](https://dev.to/realflowcontrol/processing-one-billion-rows-in-php-3eg0).
That is, if you do everything right, you can read, aggregate and calculate data from CSV at **~15 million lines per second**!

* Yeah-yeah. I know it's not the fastest tool in the world. But it's not the slowest either. See link above.
* Yeah-yeah. I know it's PHP (not Python, Go, Pyspark...). PHP is not the best language for such tasks.
* Yeah-yeah. I know it's PHP (not Python, Go, PySpark...). PHP is not the best language for such tasks.
* Yeah-yeah. It looks like a standalone binary. Right. Just use it, don't think about how it works.
* Yeah-yeah. I know you can't use as Python SDK as part of a pipeline.

Expand All @@ -637,10 +636,9 @@ So... as strictly as possible in today's PHP world. I think it works as expected

## Interesting fact

I think I've set a personal record.
The first version was written from scratch in about 3 days (with really frequent breaks to take care of 4 month baby).
I've set a personal record. The first version was written from scratch in about 3 days (with really frequent breaks to take care of 4 month baby).
I'm looking at the first commit and the very first git tag. I'd say over the weekend, in my spare time on my personal laptop.
Well... AI I only used for this Readme file because I'm not very good at English. 🤔
Well... AI was only used for this Readme file because I'm not very good at English. 🤔

I seem to be typing fast and I had really great inspiration. I hope my wife doesn't divorce me. 😅

Expand Down
21 changes: 14 additions & 7 deletions tests/ReadmeTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -64,19 +64,26 @@ public function testBadgeOfRules(): void
{
$cellRules = \count(yml(Tools::SCHEMA_FULL_YML)->findArray('columns.0.rules'));
$aggRules = \count(yml(Tools::SCHEMA_FULL_YML)->findArray('columns.0.aggregate_rules'));
$totalRules = $cellRules + $aggRules;
$extraRules = 1 + 1; // filename_pattern, schema validation
$totalRules = $cellRules + $aggRules + $extraRules;

$badge = static function (string $label, int $count): string {
$badge = static function (string $label, int $count, string $url = ''): string {
$label = \str_replace(' ', '%20', $label);

return "![Static Badge](https://img.shields.io/badge/Rules-{$count}-green" .
$badge = "![Static Badge](https://img.shields.io/badge/Rules-{$count}-green" .
"?label={$label}&labelColor=blue&color=gray)";

if ($url) {
return "[{$badge}]({$url})";
}

return $badge;
};

$text = \implode(' ', [
$badge('Total Number of Rules', $totalRules),
$badge('Cell Rules', $cellRules),
$badge('Aggregate Rules', $aggRules),
$badge('Total Number of Rules', $totalRules, 'schema-examples/full.yml'),
$badge('Cell Rules', $cellRules, 'src/Rules/Cell'),
$badge('Aggregate Rules', $aggRules, 'src/Rules/Aggregate'),
$badge('Extra Checks', $extraRules, 'schema-examples/full.yml'),
]);

Tools::insertInReadme('rules-counter', $text);
Expand Down
11 changes: 11 additions & 0 deletions tests/SchemaTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -325,4 +325,15 @@ public function testMatchTypes(): void

isSame([], $invalidPairs);
}

public function testTodoList(): void
{
isSame(
[],
Tools::findKeysToRemove(
yml(Tools::SCHEMA_FULL_YML)->getArrayCopy(),
yml(Tools::SCHEMA_TODO)->getArrayCopy(),
),
);
}
}
24 changes: 24 additions & 0 deletions tests/Tools.php
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,8 @@ final class Tools
public const SCHEMA_FULL_PHP = './schema-examples/full.php';
public const SCHEMA_INVALID = './tests/schemas/invalid_schema.yml';

public const SCHEMA_TODO = './tests/schemas/todo.yml';

public const DEMO_YML_VALID = './tests/schemas/demo_valid.yml';
public const DEMO_YML_INVALID = './tests/schemas/demo_invalid.yml';
public const DEMO_CSV = './tests/fixtures/demo.csv';
Expand Down Expand Up @@ -117,4 +119,26 @@ public static function insertInReadme(string $code, string $content): void

isFileContains($result, self::README);
}

public static function findKeysToRemove(array $current, array $todo, $path = ''): array
{
$keysToRemove = [];

foreach ($todo as $key => $value) {
$currentPath = $path === '' ? $key : $path . '.' . $key;

if (\array_key_exists($key, $current)) {
if (\is_array($value) && \is_array($current[$key])) {
$keysToRemove = \array_merge(
$keysToRemove,
self::findKeysToRemove($current[$key], $value, $currentPath),
);
} else {
$keysToRemove[] = $currentPath;
}
}
}

return $keysToRemove;
}
}
Loading
Loading