Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text-splitters: bug fix for CharacterTextSplitter replacing original separator with regex pattern when merging #23519

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

wulifu2hao
Copy link
Contributor

@wulifu2hao wulifu2hao commented Jun 26, 2024

  • Description:
    bug: currently when using CharacterTextSplitter with regex pattern as separator (e.g. "\s") and keep_separator=False to split a text like "Hello world", it will end up with "Hello\sworld" (assume chunk_size=15 so after the splitting it will need to merge it back)

    changes:

    1. "_split_text" function will use the actual original separators to merge split strings, instead of the regex pattern
    2. "_split_text_with_regex" function will return not just the split strings but also the original separators
    3. "_merge_splits" function of TextSplitter in base.py will pick the right separators to join two split strings instead of always using one separator (which could be the regex pattern)
    4. "_join_docs" function of TextSplitter in base.py, similarly, allows passing in a list of separators the facilitate the joining of strings required in the "_merge_splits" function

Issue: #23394

If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jun 26, 2024
Copy link

vercel bot commented Jun 26, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Sep 21, 2024 0:08am

@dosubot dosubot bot added Ɑ: text splitters Related to text splitters package 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Jun 26, 2024
@efriis efriis added the partner label Jun 26, 2024
@efriis efriis self-assigned this Jun 26, 2024
@efriis efriis assigned baskaryan and unassigned efriis Jul 2, 2024
@wulifu2hao
Copy link
Contributor Author

@baskaryan pls do help review when you have time , thanks :)

efriis
efriis previously approved these changes Sep 19, 2024
@dosubot dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Sep 19, 2024
@efriis efriis dismissed their stale review September 19, 2024 04:55

new tests failing

@efriis
Copy link
Member

efriis commented Sep 19, 2024

thoughts on these new tests?

@efriis efriis assigned efriis and unassigned baskaryan Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature lgtm PR looks good. Use to confirm that a PR is ready for merging. partner size:L This PR changes 100-499 lines, ignoring generated files. Ɑ: text splitters Related to text splitters package
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

3 participants