feat(splitter,pipeline): robust, context-aware JSON splitting and pipeline integration (#7) #120
+454
−6
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This PR implements robust, context-aware splitting for large JSON files and JSON code blocks, and fully integrates this logic into the document processing pipeline. It addresses #7 and related requirements for handling large JSON content from both Markdown and standalone files (local or HTTP sources).
Key Features
JsonContentSplitter
for recursive, context-aware splitting of large JSON arrays and objects, always producing valid JSON chunks.SemanticMarkdownSplitter
for code blocks, with chunk size enforcement and comprehensive tests.JsonPipeline
for processing standalone JSON files, with robust MIME type detection and chunking.FileFetcher
,LocalFileStrategy
,WebScraperStrategy
, andFetchUrlTool
to detect and process JSON files using the new pipeline.any
type; robust type safety throughout.Files Changed
src/splitter/splitters/JsonContentSplitter.ts
(new, implemented)src/splitter/splitters/JsonContentSplitter.test.ts
(new, tests)src/splitter/SemanticMarkdownSplitter.ts
(modified for JSON code block support)src/splitter/SemanticMarkdownSplitter.test.ts
(tests for JSON code block splitting)src/utils/mimeTypeUtils.ts
(added isJson helper, updated usage)src/scraper/pipelines/JsonPipeline.ts
(new, for JSON file processing)src/scraper/pipelines/JsonPipeline.test.ts
(new, tests)src/scraper/fetcher/FileFetcher.ts
(updated for .json MIME type)src/scraper/strategies/LocalFileStrategy.ts
(integrated JsonPipeline)src/scraper/strategies/LocalFileStrategy.test.ts
(integration test for JSON)src/scraper/strategies/WebScraperStrategy.ts
(integrated JsonPipeline)src/scraper/strategies/WebScraperStrategy.test.ts
(integration test for JSON)src/tools/FetchUrlTool.ts
(integrated JsonPipeline)src/tools/FetchUrlTool.test.ts
(integration test for JSON)How It Works
JsonPipeline
.Testing & Verification
npx vitest run
).Closes
Please review the implementation and integration. Feedback welcome!