Skip to content

[Feature #21943] Add StringScanner#integer_at#193

Open
jinroq wants to merge 17 commits intoruby:masterfrom
jinroq:add_integer_at
Open

[Feature #21943] Add StringScanner#integer_at#193
jinroq wants to merge 17 commits intoruby:masterfrom
jinroq:add_integer_at

Conversation

@jinroq
Copy link

@jinroq jinroq commented Mar 6, 2026

The specification for MatchData#integer_at has been defined here. StringScanner#integer_at follows this specification.

see: https://bugs.ruby-lang.org/issues/21932#note-6, https://bugs.ruby-lang.org/issues/21932#note-7
see: https://bugs.ruby-lang.org/issues/21943

return new_ary;
}

#ifdef HAVE_RB_INT_PARSE_CSTR
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on:
- push
- pull_request
- workflow_dispatch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- workflow_dispatch

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9e0d504 fixed it.

have_func("onig_region_memsize(NULL)")
have_func("rb_reg_onig_match", "ruby/re.h")
have_func("rb_deprecate_constant")
have_func("rb_int_parse_cstr")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strscan requires Ruby 2.4 or later.
What is the minimum Ruby version to use rb_int_parse_cstr()?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rb_int_parse_cstr has been available since Ruby 2.5.0. In Ruby 2.4, it is detected using have_func, and if it is not available, it falls back to rb_str_to_inum.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Can we use rb_cstr_parse_inum() with Ruby 2.4?

#ifdef HAVE_RB_INT_PARSE_CSTR
VALUE rb_int_parse_cstr(const char *str, ssize_t len, char **endp,
size_t *ndigits, int base, int flags);
#define RB_INT_PARSE_SIGN 0x01
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ruby/ruby#16322 is merged, this will report a duplicated definition warning.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1630df8 fixed it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we omit rb_int_parse_cstr() prototype and RB_INT_PARSE_SIGN definition entirely when Ruby provides them?

rb_define_method(StringScanner, "size", strscan_size, 0);
rb_define_method(StringScanner, "captures", strscan_captures, 0);
rb_define_method(StringScanner, "values_at", strscan_values_at, -1);
rb_define_method(StringScanner, "integer_at", strscan_integer_at, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rb_define_method(StringScanner, "integer_at", strscan_integer_at, 1);
rb_define_method(StringScanner, "integer_at", strscan_integer_at, 1);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1630df8 fixed it.

@jinroq jinroq requested a review from kou March 6, 2026 16:43
@eregon
Copy link
Member

eregon commented Mar 6, 2026

If https://bugs.ruby-lang.org/issues/21932 gets merged it seems cleaner to reuse that than to reimplement it.

@kou Do you know why StringScanner has an interface very similar to MatchData but yet doesn't expose the MatchData object?
In fact on TruffleRuby StringScanner uses MatchData objects internally.

I think it would be better to expose the MatchData object than keep defining methods similar to MatchData but with slightly different names. I think it makes it harder to learn the StringScanner API (i.e., it would be smaller and easier to approach if it didn't duplicate many MatchData methods).

Among all StringScanner instance methods:

  <<, [], beginning_of_line?, captures, charpos, check, check_until,
  concat, eos?, exist?, fixed_anchor?, get_byte, getch, initialize_copy,
  inspect, match?, matched, matched?, matched_size, named_captures,
  peek, peek_byte, pointer, pointer=, pos, pos=, post_match, pre_match,
  reset, rest, rest_size, scan, scan_byte, scan_integer, scan_until,
  size, skip, skip_until, string, string=, terminate, unscan, values_at

These are just doing the same on the MatchData:

  [], captures
  matched, matched?,
  matched_size (same as `byteend(0) - bytebegin(0)`, named_captures
  post_match, pre_match
  size, string, values_at

And these are MatchData methods which StringScanner doesn't have:

  begin, bytebegin, byteend, byteoffset, deconstruct,
  deconstruct_keys, end, length, match,
  match_length, names, offset,
  regexp, to_a

@eregon
Copy link
Member

eregon commented Mar 6, 2026

If https://bugs.ruby-lang.org/issues/21932 gets merged it seems cleaner to reuse that than to reimplement it.

Mmh, but that likely wouldn't achieve as good a speedup as the current approach in the context of https://bugs.ruby-lang.org/issues/21943 as it would mean an extra MatchData allocation.
The strscan extension seems to save the matched captures but not a MatchData object:

/* the regexp register; legal only when MATCHED_P(s) */
struct re_registers regs;

BTW the presense of StringScanner.must_C_version makes me wonder, was StringScanner once written in Ruby?

def test_integer_at_large_number
huge = '9' * 100
s = create_string_scanner(huge)
s.scan(/(#{huge})/)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s.scan(/(#{huge})/)
s.scan(/(\d+)/)

end

def test_integer_at_leading_zeros
s = create_string_scanner("007")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

007 is not a good data for this because 007 is valid both for base=10 and base=8. Do we need this test?

Comment on lines +1070 to +1078
# "09" would be invalid in octal, but integer_at always uses base 10
s = create_string_scanner("09")
s.scan(/(\d+)/)
assert_equal(9, s.integer_at(1))

# "010" is 8 in octal (Integer("010")), but 10 in base 10
s = create_string_scanner("010")
s.scan(/(\d+)/)
assert_equal(10, s.integer_at(1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need both of them? Can they to catch any different problem?

have_func("onig_region_memsize(NULL)")
have_func("rb_reg_onig_match", "ruby/re.h")
have_func("rb_deprecate_constant")
have_func("rb_int_parse_cstr")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Can we use rb_cstr_parse_inum() with Ruby 2.4?

Comment on lines +1944 to +1958
long j = 0;
if (ptr[0] == '-' || ptr[0] == '+') j = 1;
if (j >= len) {
rb_raise(rb_eArgError,
"non-digit character in capture: %.*s",
(int)len, ptr);
}
for (; j < len; j++) {
if (ptr[j] < '0' || ptr[j] > '9') {
rb_raise(rb_eArgError,
"non-digit character in capture: %.*s",
(int)len, ptr);
}
}
return rb_str_to_inum(rb_str_new(ptr, len), 10, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this accept 1_234?
See also: #192 (comment)

Comment on lines +1894 to +1913
GET_SCANNER(self, p);
if (! MATCHED_P(p)) return Qnil;

switch (TYPE(idx)) {
case T_SYMBOL:
idx = rb_sym2str(idx);
/* fall through */
case T_STRING:
RSTRING_GETMEM(idx, name, i);
i = name_to_backref_number(&(p->regs), p->regex, name, name + i, rb_enc_get(idx));
break;
default:
i = NUM2LONG(idx);
}

if (i < 0)
i += p->regs.num_regs;
if (i < 0) return Qnil;
if (i >= p->regs.num_regs) return Qnil;
if (p->regs.beg[i] == -1) return Qnil;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You copied this from strscan_aref(), right? Can we share common code with strscan_aref() and strsacn_integer_at()?

end = adjust_register_position(p, p->regs.end[i]);
len = end - beg;

if (len <= 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use == 0 here?
len may be negative?

len = end - beg;

if (len <= 0) {
rb_raise(rb_eArgError, "empty capture for integer conversion");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rb_raise(rb_eArgError, "empty capture for integer conversion");
rb_raise(rb_eArgError, "specified capture is empty: %"PRIsVALUE, idx);


if (endp != ptr + len) {
rb_raise(rb_eArgError,
"non-digit character in capture: %.*s",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any other reason on failure?


if (endp != ptr + len) {
rb_raise(rb_eArgError,
"non-digit character in capture: %.*s",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the target string has a trailing space, it's difficult to find a problem. How about surround the target string something like the following?

Suggested change
"non-digit character in capture: %.*s",
"non-digit character in capture: <%.*s>",

@kou
Copy link
Member

kou commented Mar 7, 2026

Do you know why StringScanner has an interface very similar to MatchData but yet doesn't expose the MatchData object?

No. But if we create a MatchData, it causes performance overhead, right? (StringScanner doesn't use MatchData internally.) It'll reduce a merit of this optimization.

BTW the presense of StringScanner.must_C_version makes me wonder, was StringScanner once written in Ruby?

Yes. But it's before StringScanner was imported to Ruby itself.

FYI: https://i.loveruby.net/ja/projects/strscan/doc/ChangeLog.html (Japanese)

@eregon
Copy link
Member

eregon commented Mar 7, 2026

No. But if we create a MatchData, it causes performance overhead, right?

Yeah, and I guess that's the main reason StringScanner directly exposes MatchData-like methods.
StringScanner could still have a new method to return a MatchData, so MatchData methods which are not mirrored in StringScanner could be used.

FYI: https://i.loveruby.net/ja/projects/strscan/doc/ChangeLog.html (Japanese)

Interesting, thank you for the link.

@kou
Copy link
Member

kou commented Mar 7, 2026

StringScanner could still have a new method to return a MatchData, so MatchData methods which are not mirrored in StringScanner could be used.

Yes. But it's out-of-scope of this.

jinroq added 2 commits March 18, 2026 22:04
Add a method that returns a captured substring as an Integer,
following String#to_i(base) semantics. Accepts an optional base
argument (default 10), Symbol/String for named capture groups,
and returns 0 for non-numeric or empty captures.

Extract resolve_capture_index helper to share index resolution
logic between StringScanner#[] and StringScanner#integer_at.
When base is 10 and the capture contains only digits (with optional
sign) that fit in long, parse directly and return via LONG2NUM.
This covers the Date._strptime use case without temporary String
creation. All other cases fall through to rb_str_to_inum.
Provide a pure Ruby implementation using self[index].to_i(base)
for JRuby and other non-CRuby platforms. The C extension version
takes precedence when available.
@jinroq
Copy link
Author

jinroq commented Mar 22, 2026

@kou
This comment finalizes the specification for MatchData#integer_at(n). StringScanner#integer_at(n) has also been updated to comply with this specification. Please review it.

@jinroq jinroq requested a review from kou March 22, 2026 15:42
* This covers the Date._strptime use case. */
if (base == 10) {
long j = 0;
int negative = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use bool instead of int for boolean?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

485f27c fixed it.

Comment on lines +15 to +24
unless method_defined?(:integer_at)
# Fallback implementation for platforms without C extension (e.g. JRuby).
# Equivalent to self[index].to_i(base).
def integer_at(index, base = 10)
str = self[index]
return nil if str.nil?
str.to_i(base)
end
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't split #scan_integer documentation and implementation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ef6281d fixed it.

}
}
if (all_digits) {
if (digit_count <= (sizeof(long) >= 8 ? 18 : 9)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • It seems that 9223372036854775807 (max int64_t value) isn't optimized. Is it intentional?
  • It seems that 00000000000000000001 isn't optimized. Is it intentional?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was not my intention. 6052143 fixed it.


def test_integer_at_index_zero
s = create_string_scanner("42 abc")
s.scan(/(\d+)/)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need (...) here, right?

Suggested change
s.scan(/(\d+)/)
s.scan(/\d+/)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f64fdd8 fixed it.

assert_equal({"number" => "1"}, scan.named_captures)
end

def test_integer_at
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use test_integer_at_XXX like other methods?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6a55781 fixed it.

Comment on lines +1121 to +1125
def test_integer_at_named_capture_undefined
s = create_string_scanner("2024-06-15")
s.scan(/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/)
assert_raise(IndexError) { s.integer_at(:unknown) }
assert_raise(IndexError) { s.integer_at("unknown") }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use unknown for both of test name and test value?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4d3583f fixed it.

Comment on lines +1144 to +1149
def test_integer_at_underscore
# follows String#to_i: underscores are accepted
s = create_string_scanner("1_0_0")
s.scan(/(\d+(?:_\d+)*)/)
assert_equal(100, s.integer_at(1))
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we optimize this case too?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e686711 fixed it.

assert_equal(999999999999999999, s.integer_at(1))

# 19 digits: exceeds long on 64-bit, becomes bignum
s = create_string_scanner("9999999999999999999")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we should use border values for testing. If "9" * 18 is the largest optimized value, we should use "9" * 18" and "1" * 19" (the next value of "9" * 18").

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0f2ad2a fixed it.


/*
* call-seq:
* integer_at(index, base = 10) -> integer or nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use specifier not index like we did for []?

strscan/ext/strscan/strscan.c

Lines 1625 to 1695 in 4243751

/*
*
* :markup: markdown
* :include: strscan/link_refs.txt
*
* call-seq:
* [](specifier) -> substring or nil
*
* Returns a captured substring or `nil`;
* see [Captured Match Values][13].
*
* When there are captures:
*
* ```rb
* scanner = StringScanner.new('Fri Dec 12 1975 14:39')
* scanner.scan(/(?<wday>\w+) (?<month>\w+) (?<day>\d+) /)
* ```
*
* - `specifier` zero: returns the entire matched substring:
*
* ```rb
* scanner[0] # => "Fri Dec 12 "
* scanner.pre_match # => ""
* scanner.post_match # => "1975 14:39"
* ```
*
* - `specifier` positive integer. returns the `n`th capture, or `nil` if out of range:
*
* ```rb
* scanner[1] # => "Fri"
* scanner[2] # => "Dec"
* scanner[3] # => "12"
* scanner[4] # => nil
* ```
*
* - `specifier` negative integer. counts backward from the last subgroup:
*
* ```rb
* scanner[-1] # => "12"
* scanner[-4] # => "Fri Dec 12 "
* scanner[-5] # => nil
* ```
*
* - `specifier` symbol or string. returns the named subgroup, or `nil` if no such:
*
* ```rb
* scanner[:wday] # => "Fri"
* scanner['wday'] # => "Fri"
* scanner[:month] # => "Dec"
* scanner[:day] # => "12"
* scanner[:nope] # => nil
* ```
*
* When there are no captures, only `[0]` returns non-`nil`:
*
* ```rb
* scanner = StringScanner.new('foobarbaz')
* scanner.exist?(/bar/)
* scanner[0] # => "bar"
* scanner[1] # => nil
* ```
*
* For a failed match, even `[0]` returns `nil`:
*
* ```rb
* scanner.scan(/nope/) # => nil
* scanner[0] # => nil
* scanner[1] # => nil
* ```
*
*/

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

433fd87 fixed it.

VALUE idx, vbase;
int base = 10;

rb_scan_args(argc, argv, "11", &idx, &vbase);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use specifier not idx?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

433fd87 fixed it.

jinroq added 12 commits March 23, 2026 20:48
Skip leading zeros to compute effective digit count, allowing values like "00000000000000000001" to use the fast path. Add overflow-checked parsing for 19-digit values so LONG_MAX fits in the fast path while LONG_MAX+1 correctly falls through to rb_str_to_inum.
Remove nested capture group and check group 3 directly for nil.
Non-digit behavior is already covered by test_integer_at_non_digit and index 0 is covered by test_integer_at_index_zero.
Extend base-10 fast path to parse underscore-separated digits(e.g. "1_000_000") without temporary String allocation, following String#to_i underscore rules.
jinroq added 2 commits March 24, 2026 20:20
Replace "9" * 19 with "1" * 19 as the correct next value after "9" * 18, and add LONG_MIN - 1 test to pair with LONG_MIN.
Rename the parameter in RDoc, C implementation, and Ruby fallback to match the naming convention used in StringScanner#[].
@jinroq jinroq requested a review from kou March 24, 2026 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants