Skip to content

feat(splitter,pipeline): robust, context-aware JSON splitting and pipeline integration (#7) #120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

arabold
Copy link
Owner

@arabold arabold commented May 18, 2025

Overview

This PR implements robust, context-aware splitting for large JSON files and JSON code blocks, and fully integrates this logic into the document processing pipeline. It addresses #7 and related requirements for handling large JSON content from both Markdown and standalone files (local or HTTP sources).

Key Features

  • Recursive, Valid JSON Splitting:
    • Introduces JsonContentSplitter for recursive, context-aware splitting of large JSON arrays and objects, always producing valid JSON chunks.
    • Ensures chunk size limits are respected and splitting is robust for deeply nested or large structures.
  • Markdown JSON Code Block Support:
    • Integrates JSON splitting into SemanticMarkdownSplitter for code blocks, with chunk size enforcement and comprehensive tests.
  • Dedicated JSON Pipeline:
    • Adds JsonPipeline for processing standalone JSON files, with robust MIME type detection and chunking.
    • Supports both local file and HTTP/remote JSON sources.
  • Pipeline Integration:
    • Updates FileFetcher, LocalFileStrategy, WebScraperStrategy, and FetchUrlTool to detect and process JSON files using the new pipeline.
    • Ensures OpenAPI/Swagger JSON specs are supported (YAML not yet supported).
  • Testing:
    • Adds/updates unit, integration, and end-to-end tests for all entry points (Markdown, local files, HTTP, tool-based).
    • All JSON-related tests use structure-based assertions for robustness against formatting changes.
  • Code Quality:
    • Adheres to DRY, KISS, and SOLID principles.
    • All new/modified code is fully covered by TSDoc.
    • No use of any type; robust type safety throughout.

Files Changed

  • src/splitter/splitters/JsonContentSplitter.ts (new, implemented)
  • src/splitter/splitters/JsonContentSplitter.test.ts (new, tests)
  • src/splitter/SemanticMarkdownSplitter.ts (modified for JSON code block support)
  • src/splitter/SemanticMarkdownSplitter.test.ts (tests for JSON code block splitting)
  • src/utils/mimeTypeUtils.ts (added isJson helper, updated usage)
  • src/scraper/pipelines/JsonPipeline.ts (new, for JSON file processing)
  • src/scraper/pipelines/JsonPipeline.test.ts (new, tests)
  • src/scraper/fetcher/FileFetcher.ts (updated for .json MIME type)
  • src/scraper/strategies/LocalFileStrategy.ts (integrated JsonPipeline)
  • src/scraper/strategies/LocalFileStrategy.test.ts (integration test for JSON)
  • src/scraper/strategies/WebScraperStrategy.ts (integrated JsonPipeline)
  • src/scraper/strategies/WebScraperStrategy.test.ts (integration test for JSON)
  • src/tools/FetchUrlTool.ts (integrated JsonPipeline)
  • src/tools/FetchUrlTool.test.ts (integration test for JSON)

How It Works

  • When a large JSON file or JSON code block is encountered, the new splitter recursively divides it into valid, context-aware chunks.
  • The pipeline detects JSON files via MIME type or extension and routes them through the new JsonPipeline.
  • All entry points (Markdown, local files, HTTP, tool) are covered and tested.

Testing & Verification

  • All new and existing tests pass (npx vitest run).
  • JSON-related tests are robust to formatting and focus on structure.
  • Manual and automated tests confirm correct chunking, pipeline selection, and integration.

Closes


Please review the implementation and integration. Feedback welcome!

arabold added 2 commits May 17, 2025 17:33
…ize for #7

- Add JsonContentSplitter for recursive, size-aware JSON chunking
- Integrate with SemanticMarkdownSplitter for JSON code blocks
- Ensure code block wrapper is included in chunk size calculation
- Add and update tests for JSON splitting

Closes #7
- Implement JsonPipeline with recursive, valid-chunk splitting for large JSON files and code blocks
- Integrate JsonPipeline into local and remote (HTTP) file processing strategies
- Update FileFetcher to detect .json files and set correct MIME type
- Add/extend tests for end-to-end scraping and tool usage of JSON files (local and remote)
- Make all JSON-related tests robust to formatting (structure-based assertions)
- Ensure MIME type detection and pipeline selection is robust for JSON
- Update FetchUrlTool and strategies to support JSON pipeline

Closes #7
@arabold
Copy link
Owner Author

arabold commented May 26, 2025

The reason I'm holding this back is that the new splitting logic makes it impossible to concatenate chunks again while retaining the original JSON structure. This might not be the best idea after all...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant