Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Issue for CharIndices::offset function #83871

Closed
1 of 3 tasks
TrolledWoods opened this issue Apr 5, 2021 · 42 comments · Fixed by #129276
Closed
1 of 3 tasks

Tracking Issue for CharIndices::offset function #83871

TrolledWoods opened this issue Apr 5, 2021 · 42 comments · Fixed by #129276
Labels
C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. finished-final-comment-period The final comment period is finished for this PR / Issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@TrolledWoods
Copy link
Contributor

TrolledWoods commented Apr 5, 2021

Feature gate: #![feature(char_indices_offset)]

This is a tracking issue for the function CharIndices::offset. It returns the byte position of the next character, or the length of the underlying string if there are no more characters. This is useful for getting ranges over strings you're iterating over.

Public API

let mut chars = "a楽".char_indices();

assert_eq!(chars.offset(), 0);
assert_eq!(chars.next(), Some((0, 'a')));

assert_eq!(chars.offset(), 1);
assert_eq!(chars.next(), Some((1, '楽')));

assert_eq!(chars.offset(), 4);
assert_eq!(chars.next(), None);

Steps / History

Unresolved Questions

@TrolledWoods TrolledWoods added C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Apr 5, 2021
@GilRtr
Copy link

GilRtr commented Jul 4, 2021

Is there anything holding this back?

@pmetzger
Copy link

I just realized I desperately wanted this today. As it stands it's rather unpleasant assembling a range of characters in a string because you can't easily get the byte offset of the next character to be the end of the range, so then constructing a slice when what you have now is the last character is unpleasant.

@TrolledWoods
Copy link
Contributor Author

I realized I sort of forgot about this, sorry about that, is there anything I need to do to move this along in stabilization?

@pmetzger
Copy link

No idea but I'd really love to see it stabilized.

@Jay-Madden
Copy link

Agreed, this would be very useful

@cogsandsquigs
Copy link

Yeah, it'd be really nice to have this finally released in the stable builds. Any progress on this?

@pmetzger
Copy link

pmetzger commented Oct 1, 2022

@TrolledWoods Maybe you can ask around on the Rust dev fora about how to progress this?

@jdahlstrom
Copy link

jdahlstrom commented Oct 4, 2022

As I learned a while back on u.r.l.o, there's also char::len_utf8 that can be used to construct a slice representing (or ending at) the current char.

@m-ou-se
Copy link
Member

m-ou-se commented Nov 26, 2022

@rfcbot merge

@rfcbot
Copy link

rfcbot commented Nov 26, 2022

Team member @m-ou-se has proposed to merge this. The next step is review by the rest of the tagged team members:

Concerns:

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

@rfcbot rfcbot added proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. labels Nov 26, 2022
@Amanieu
Copy link
Member

Amanieu commented Dec 6, 2022

In my experience, additional methods on iterator types tend to be hard/unergonomic to use. In this case it would be better to return the offset as part of a tuple in the iterator's item type.

@rfcbot concern bad API

@BurntSushi
Copy link
Member

Can someone show a small but real world program where this method is used?

Also @Amanieu, I think that suggestion would require building a new iterator type just for this. Probably an offset() method, if the use case makes sense, is the less-bad option.

@Amanieu
Copy link
Member

Amanieu commented Dec 29, 2022

Also @Amanieu, I think that suggestion would require building a new iterator type just for this. Probably an offset() method, if the use case makes sense, is the less-bad option.

I disagree, methods on iterators are very difficult to use ergonomically. Consider Iterator::peekable: it provides useful functionality, but is such a pain to use that I actively avoid it wherever I can. In that specific case the only reason we didn't go with the approach of returning a new iterator that yields (T, Option<&T>) is that the lifetime could not be expressed due to the lack of GATs.

If character offsets are really needed then they should be returned through a proper iterator interface that is compatible with all the existing iterator combinators.

@pmetzger
Copy link

pmetzger commented May 3, 2023

I have only a big real world program where I needed it, not a small one. I had a hand-written lexical analyzer that needed it. However, in the intervening years, I replaced the lexical analyzer with one generated by re2c now that it produces rust code, so I don't have an example handy any more.

@solson
Copy link
Member

solson commented Jun 26, 2023

In this case it would be better to return the offset as part of a tuple in the iterator's item type.

@Amanieu You seem to have gotten a bit confused here - your suggestion is precisely the CharIndices iterator that already exists. Its whole deal is returning the offset as part of a tuple in its item type. This proposal is to add a method to CharIndices to access the current position (which it always knows) before advancing, or when at the end.

The implementation PR's description describes a few of the problems I just ran into while trying to use CharIndices in a lexer.


Can someone show a small but real world program where this method is used?

@BurntSushi Here's an excerpt of something I was just working on, followed by how it looks with CharIndices::offset. Realistically, though, the cleanest approach while this method is unstable is to write a near copy of CharIndices with this method added, rather than working around the edge cases with peeking and manual end tracking.

struct Lexer<'a> {
    input: Peekable<CharIndices<'a>>,
    end: usize, // Initialized to the length of the input.
}

impl Lexer<'a> {
    fn lex(&mut self) -> Token {
        let Some((start, c)) = self.input.next() else {
            return Token {
                kind: Eof,
                start: self.end,
                end: self.end,
            }
        };

        let kind = match c {
            ...
        };

        let end = self.input.peek().map(|(pos, _)| pos).unwrap_or(self.end);
        Token { kind, start, end }
    }
}
struct Lexer<'a> {
    input: CharIndices<'a>,
}

impl Lexer<'a> {
    fn lex(&mut self) -> Token {
        let Some((start, c)) = self.input.next() else {
            return Token {
                kind: Eof,
                start: self.input.offset(),
                end: self.input.offset(),
            }
        };

        let kind = match c {
            ...
        };

        let end = self.input.offset();
        Token { kind, start, end }
    }
}

@Amanieu
Copy link
Member

Amanieu commented Jun 26, 2023

Thanks, this gives a better idea of how this is useful in practice. I believe this API is good, but I don't like the naming: offset isn't very clear (what offset?). Perhaps next_offset or next_char_offset would be clearer?

@jdahlstrom
Copy link

offset isn't very clear (what offset?). Perhaps next_offset or next_char_offset would be clearer?

Indeed, this baffled me a bit at first as well. It's supposed to be the index that you get next (with the difference that it retuns one-past-the-end when next would return None).

Furthermore, "offset" is a word alien to the current public API; the type and method names use "indices" and the docs additionally talk about "position". If "offset" is just a synonym to "index" or "position" then one of those words should be used instead, preferably "index" because that's what the API already uses. So next_index or next_char_index?

@solson
Copy link
Member

solson commented Jun 26, 2023

Personally, I always think of it as "the current position". I've written near copies of CharIndices in the past and this field would always be named something like position or pos. But it would likely make sense to use "index" as mentioned because of the type name and the phrase "byte index" used in docs.

I'm ambivalent on attaching next_ to the name. It's accurate to think of it simply as the current position of the iterator. The reason .next() returns this value (if not at the end) is because the next char always starts at the current position (but extends beyond it, if multibyte).

In either case it's one of those things that is really straightforward if you go-to-definition into the stdlib source and a bit wordier to explain in prose...

@andrewhickman
Copy link
Contributor

Another possible name is peek or peek_index for analogy with the Peekable iterator.

@eternaleye
Copy link
Contributor

Another possible name is peek or peek_index for analogy with the Peekable iterator.

It's still valid at the end, though, which makes that analogy misleading.

@m-ou-se
Copy link
Member

m-ou-se commented Jun 26, 2023

Thanks, this gives a better idea of how this is useful in practice. I believe this API is good, but I don't like the naming: offset isn't very clear (what offset?). Perhaps next_offset or next_char_offset would be clearer?

The offset of the iterator into the string it iterates over. 'next offset' or 'next char offset' wouldn't make sense for the offset at the end (when .as_str() is empty), when there is no next item.

the type and method names use "indices" and the docs additionally talk about "position". If "offset" is just a synonym to "index" or "position" then one of those words should be used instead, preferably "index" because that's what the API already uses.

Calling this index() or position() could work, although then it's less clear we're talking about a byte offset rather than counting chars (like .chars().position()). (But perhaps that ship has sailed by calling this type CharIndices instead of CharOffsets.)

@Amanieu
Copy link
Member

Amanieu commented Jun 26, 2023

Thinking about this some more, it seems that what you really want here is a CharRanges iterator that returns a Range<usize> for each character. Would that work better for your use case? As I've said before, I'm not really a fan of methods on iterators since they don't work well with typical use cases for them.

@m-ou-se
Copy link
Member

m-ou-se commented Jun 26, 2023

I recently ran into nearly the same use case as mentioned above. (I currently use Chars::as_str() and use .as_ptr() to do pointer math to get the offset.)

In my use case, I don't just want CharRanges, because I also use the offset for error reporting to know where we are in the input, without having to store the location separately when the iterator already knows anyway.

And without an offset/position/index method, I'd still have to special case the final location where the iterator returns None but I still needs it offset in the input (at the end).

@m-ou-se
Copy link
Member

m-ou-se commented Jun 26, 2023

As I've said before, I'm not really a fan of methods on iterators since they don't work well with typical use cases for them.

I use Chars::as_str() quite often to get the unconsumed part of the string back for later use. I don't think I've ever had any issues with its API.

@solson
Copy link
Member

solson commented Jun 26, 2023

Some similar code only stores the start index for each token, since you can reconstruct ranges later, so ranges aren't necessary. Ranges also are not sufficient because they can't give you the iterator's position when it's at the end, for an Eof token. (It would unhelpfully give you that value slightly too early, with the last char, before you know you're about to hit Eof.)

On a more general note, I find that additional methods on iterators often make the difference between getting frustrated and doing things by hand instead, or feeling like I'm using a beautiful API that's considered the gaps and edge cases. I'm suspicious of iterators that only support next - they're often missing something important.

From another angle if I just "follow the data" this method exposes something CharIndices must know, and it comes down to whether it needlessly obscures it from me or not.

@BurntSushi
Copy link
Member

I agree. While I think "consider a new iterator" is a good idea in general, if there's a niche use case that can be served by adding a natural method to an existing iterator type, then that seems like a win to me.

@jdahlstrom
Copy link

I'm ambivalent on attaching next_ to the name. It's accurate to think of it simply as the current position of the iterator. The reason .next() returns this value (if not at the end) is because the next char always starts at the current position (but extends beyond it, if multibyte).

Yes, it makes sense from the implementer's point of view, but IMO not the user's. And API naming should of course be user-centric. The fact that next() uses the current state is just an implementation detail. From the user's POV the state change happens when next is called, not before.

@m-ou-se
Copy link
Member

m-ou-se commented Jun 26, 2023

It'd be an issue if we have a .next() method that modifies the state to advance the iterator and also have a .next_offset() method that does not modify the state and just returns the current position.

@WhyNotHugo
Copy link

I agree that this method is hard to use "in the usual places where you'd use an iterator". This isn't really a problem: this method it's not meant to be used in such cases; it's meant to be used in logic that's dealing specifically with the CharIndeces type (typically code that's parsing a string).

This type already has an as_str method, which exposes also exposes part of the the underlying data. as_str is also "hard to use" in the usual contexts where you'd use a generic iterator. And it also isn't intended to be used in such places.

From what I understand, the current major holdback is the naming. Perhaps offset_of_next is unambiguous enough?

@pmetzger
Copy link

Delaying a useful API for long periods because of difficulty finding the right name seems unfortunate. Perhaps someone could just post a poll with a few suggestions and be done with it?

@TrolledWoods
Copy link
Contributor Author

Thinking back on this I agree that offset is not a great name. Since CharIndices already uses "Indices" as the term, maybe index is a decent name? I would think that using next as a prefix/suffix might imply mutation, so if we're going to add something extra maybe peek or current might be reasonable? Though as the as_str method already returns the remaining string and not the whole string, I think it's fairly clear what it does even without any extra attachments.

@pmetzger
Copy link

One can ask what color to paint the bikeshed endlessly. (This doesn't mean having a good name isn't important, just that the conversation may not terminate, having an okayish name and having the thing available is better than waiting forever for perfection, and someone is going to have to make a decision.)

@sffc
Copy link

sffc commented Feb 3, 2024

A function with the following signature would be useful to me in the code I'm currently writing:

impl<'a> CharIndices<'a> {
    pub fn current(&self) -> (usize, Option<char>) { ... }
}

However, if the implementation of this function would be no better than char_indices.as_str().get(char_indices.offset()), then it seems fine to return the offset by itself, in which case offset() seems like a fine function name.

Currently I am using Peekable<CharIndices> which is just extremely wasteful and unergonomic.

@apt1002
Copy link

apt1002 commented May 23, 2024

A work-around for finding the current byte index is to subtract chars.as_str().len() from the length of the original string. This idea has the following advantages:

  • It avoids calling offset().
  • It avoids parsing the UTF8 bytes twice.
  • It avoids a special case for the end of the string.
  • It's clearer what you mean when iterating through a substring.
  • It even works with a Chars (as opposed to a CharIndices).

The main disadvantage is that you have to separately record the length of the original string.

PS I agree with calling it index().

@joshtriplett
Copy link
Member

I do think this needs a better name, something that makes it clear it's the index of the next character.

@rfcbot concern better-name

But I do think we should have it:

@rfcbot reviewed

@dtolnay
Copy link
Member

dtolnay commented Aug 6, 2024

IMO offset() is fine as is, and I don't think a better name has been proposed so far in this issue. I agree with the following 3 comments on why putting next into the name would not be better:

Namely:

  • The offset it returns can be conceived as the current offset of the iterator. The iterator has already been mutated as part of advancing past the character that it most recently returned. The offset of the character already returned is the offset of the previous character from the point of view of the iterator. The caller already has the offset of the previous character because they received it in the iterator's most recent Item. That item is no longer what the iterator refers to, so I don't see ambiguity about whether the iterator's offset is still that previous location.

  • There is not necessarily any next character in the iterator. Offset doesn't return the offset of a next character, it returns the current position of the iterator.

  • Having one next that advances the iterator and returns tuple of offset and char, and a different …next… that returns offset without char, but does not advance the iterator, is misleading.

The difference in mental model between "previous"/"current" vs "previous"/"next" is similar to something we debated previously in rust-lang/rfcs#2570 (comment). For linked list cursor, it did make sense to me that the cursor points in between linked list nodes, so "before"/"after" was a better way to design the API than "before"/"current".

But notice that this is meaningfully different than the char offsets use case! We redesigned linked list cursors from pointing at a node, to pointing at the boundary between two nodes. But offsets already refer to the boundary between two chars! The char range 0..1 refers to a range from the boundary before the initial byte, to the boundary between the initial byte and the next one. When a char indices iterator is at a particular char boundary, there is no "previous"/"next" offset. There is a previous/next char, and there is a current offset.

@Amanieu
Copy link
Member

Amanieu commented Aug 6, 2024

David's reasoning above resolves my concern.

@rfcbot resolve bad API

@joshtriplett
Copy link
Member

Alright, I'm convinced by the description that it's the offset of the current character after having called next. I think that's going to need some clear documentation showing a loop and calling attention to how after next the iterator is already pointing after the returned character, but as a name it works.

@rfcbot resolved better-name

@rfcbot rfcbot added the final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. label Aug 9, 2024
@rfcbot
Copy link

rfcbot commented Aug 9, 2024

🔔 This is now entering its final comment period, as per the review above. 🔔

@rfcbot rfcbot removed the proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. label Aug 9, 2024
@rfcbot rfcbot added finished-final-comment-period The final comment period is finished for this PR / Issue. to-announce Announce this issue on triage meeting and removed final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. labels Aug 19, 2024
@rfcbot
Copy link

rfcbot commented Aug 19, 2024

The final comment period, with a disposition to merge, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

This will be merged soon.

@apiraino apiraino removed the to-announce Announce this issue on triage meeting label Aug 22, 2024
tgross35 added a commit to tgross35/rust that referenced this issue Aug 23, 2024
…ffset, r=Amanieu

Stabilize feature `char_indices_offset`

Stabilized API:

```rust
impl CharIndices<'_> {
    pub fn offset(&self) -> usize;
}
```

Tracking issue: rust-lang#83871

Closes rust-lang#83871

I also attempted to improved the documentation to make it more clear that it returns the offset of the character that will be returned by the next call to `next()`.
tgross35 added a commit to tgross35/rust that referenced this issue Aug 23, 2024
…ffset, r=Amanieu

Stabilize feature `char_indices_offset`

Stabilized API:

```rust
impl CharIndices<'_> {
    pub fn offset(&self) -> usize;
}
```

Tracking issue: rust-lang#83871

Closes rust-lang#83871

I also attempted to improved the documentation to make it more clear that it returns the offset of the character that will be returned by the next call to `next()`.
GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this issue Aug 23, 2024
…ffset, r=Amanieu

Stabilize feature `char_indices_offset`

Stabilized API:

```rust
impl CharIndices<'_> {
    pub fn offset(&self) -> usize;
}
```

Tracking issue: rust-lang#83871

Closes rust-lang#83871

I also attempted to improved the documentation to make it more clear that it returns the offset of the character that will be returned by the next call to `next()`.
@bors bors closed this as completed in 26672c9 Aug 23, 2024
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Aug 23, 2024
Rollup merge of rust-lang#129276 - eduardosm:stabilize-char_indices_offset, r=Amanieu

Stabilize feature `char_indices_offset`

Stabilized API:

```rust
impl CharIndices<'_> {
    pub fn offset(&self) -> usize;
}
```

Tracking issue: rust-lang#83871

Closes rust-lang#83871

I also attempted to improved the documentation to make it more clear that it returns the offset of the character that will be returned by the next call to `next()`.
@pmetzger
Copy link

Hooray!

@Jay-Madden
Copy link

Excellent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. finished-final-comment-period The final comment period is finished for this PR / Issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.