[Feature #21943] Add StringScanner#integer_at#193
[Feature #21943] Add StringScanner#integer_at#193jinroq wants to merge 17 commits intoruby:masterfrom
StringScanner#integer_at#193Conversation
ext/strscan/strscan.c
Outdated
| return new_ary; | ||
| } | ||
|
|
||
| #ifdef HAVE_RB_INT_PARSE_CSTR |
.github/workflows/ci.yml
Outdated
| on: | ||
| - push | ||
| - pull_request | ||
| - workflow_dispatch |
There was a problem hiding this comment.
| - workflow_dispatch |
ext/strscan/extconf.rb
Outdated
| have_func("onig_region_memsize(NULL)") | ||
| have_func("rb_reg_onig_match", "ruby/re.h") | ||
| have_func("rb_deprecate_constant") | ||
| have_func("rb_int_parse_cstr") |
There was a problem hiding this comment.
strscan requires Ruby 2.4 or later.
What is the minimum Ruby version to use rb_int_parse_cstr()?
There was a problem hiding this comment.
rb_int_parse_cstr has been available since Ruby 2.5.0. In Ruby 2.4, it is detected using have_func, and if it is not available, it falls back to rb_str_to_inum.
There was a problem hiding this comment.
OK. Can we use rb_cstr_parse_inum() with Ruby 2.4?
ext/strscan/strscan.c
Outdated
| #ifdef HAVE_RB_INT_PARSE_CSTR | ||
| VALUE rb_int_parse_cstr(const char *str, ssize_t len, char **endp, | ||
| size_t *ndigits, int base, int flags); | ||
| #define RB_INT_PARSE_SIGN 0x01 |
There was a problem hiding this comment.
If ruby/ruby#16322 is merged, this will report a duplicated definition warning.
There was a problem hiding this comment.
Can we omit rb_int_parse_cstr() prototype and RB_INT_PARSE_SIGN definition entirely when Ruby provides them?
ext/strscan/strscan.c
Outdated
| rb_define_method(StringScanner, "size", strscan_size, 0); | ||
| rb_define_method(StringScanner, "captures", strscan_captures, 0); | ||
| rb_define_method(StringScanner, "values_at", strscan_values_at, -1); | ||
| rb_define_method(StringScanner, "integer_at", strscan_integer_at, 1); |
There was a problem hiding this comment.
| rb_define_method(StringScanner, "integer_at", strscan_integer_at, 1); | |
| rb_define_method(StringScanner, "integer_at", strscan_integer_at, 1); |
|
If https://bugs.ruby-lang.org/issues/21932 gets merged it seems cleaner to reuse that than to reimplement it. @kou Do you know why I think it would be better to expose the MatchData object than keep defining methods similar to MatchData but with slightly different names. I think it makes it harder to learn the StringScanner API (i.e., it would be smaller and easier to approach if it didn't duplicate many MatchData methods). Among all StringScanner instance methods: These are just doing the same on the MatchData: And these are MatchData methods which StringScanner doesn't have: |
Mmh, but that likely wouldn't achieve as good a speedup as the current approach in the context of https://bugs.ruby-lang.org/issues/21943 as it would mean an extra MatchData allocation. Lines 57 to 58 in 3592c39 BTW the presense of |
test/strscan/test_stringscanner.rb
Outdated
| def test_integer_at_large_number | ||
| huge = '9' * 100 | ||
| s = create_string_scanner(huge) | ||
| s.scan(/(#{huge})/) |
There was a problem hiding this comment.
| s.scan(/(#{huge})/) | |
| s.scan(/(\d+)/) |
test/strscan/test_stringscanner.rb
Outdated
| end | ||
|
|
||
| def test_integer_at_leading_zeros | ||
| s = create_string_scanner("007") |
There was a problem hiding this comment.
007 is not a good data for this because 007 is valid both for base=10 and base=8. Do we need this test?
test/strscan/test_stringscanner.rb
Outdated
| # "09" would be invalid in octal, but integer_at always uses base 10 | ||
| s = create_string_scanner("09") | ||
| s.scan(/(\d+)/) | ||
| assert_equal(9, s.integer_at(1)) | ||
|
|
||
| # "010" is 8 in octal (Integer("010")), but 10 in base 10 | ||
| s = create_string_scanner("010") | ||
| s.scan(/(\d+)/) | ||
| assert_equal(10, s.integer_at(1)) |
There was a problem hiding this comment.
Do we need both of them? Can they to catch any different problem?
ext/strscan/extconf.rb
Outdated
| have_func("onig_region_memsize(NULL)") | ||
| have_func("rb_reg_onig_match", "ruby/re.h") | ||
| have_func("rb_deprecate_constant") | ||
| have_func("rb_int_parse_cstr") |
There was a problem hiding this comment.
OK. Can we use rb_cstr_parse_inum() with Ruby 2.4?
ext/strscan/strscan.c
Outdated
| long j = 0; | ||
| if (ptr[0] == '-' || ptr[0] == '+') j = 1; | ||
| if (j >= len) { | ||
| rb_raise(rb_eArgError, | ||
| "non-digit character in capture: %.*s", | ||
| (int)len, ptr); | ||
| } | ||
| for (; j < len; j++) { | ||
| if (ptr[j] < '0' || ptr[j] > '9') { | ||
| rb_raise(rb_eArgError, | ||
| "non-digit character in capture: %.*s", | ||
| (int)len, ptr); | ||
| } | ||
| } | ||
| return rb_str_to_inum(rb_str_new(ptr, len), 10, 0); |
ext/strscan/strscan.c
Outdated
| GET_SCANNER(self, p); | ||
| if (! MATCHED_P(p)) return Qnil; | ||
|
|
||
| switch (TYPE(idx)) { | ||
| case T_SYMBOL: | ||
| idx = rb_sym2str(idx); | ||
| /* fall through */ | ||
| case T_STRING: | ||
| RSTRING_GETMEM(idx, name, i); | ||
| i = name_to_backref_number(&(p->regs), p->regex, name, name + i, rb_enc_get(idx)); | ||
| break; | ||
| default: | ||
| i = NUM2LONG(idx); | ||
| } | ||
|
|
||
| if (i < 0) | ||
| i += p->regs.num_regs; | ||
| if (i < 0) return Qnil; | ||
| if (i >= p->regs.num_regs) return Qnil; | ||
| if (p->regs.beg[i] == -1) return Qnil; |
There was a problem hiding this comment.
You copied this from strscan_aref(), right? Can we share common code with strscan_aref() and strsacn_integer_at()?
ext/strscan/strscan.c
Outdated
| end = adjust_register_position(p, p->regs.end[i]); | ||
| len = end - beg; | ||
|
|
||
| if (len <= 0) { |
There was a problem hiding this comment.
Can we use == 0 here?
len may be negative?
ext/strscan/strscan.c
Outdated
| len = end - beg; | ||
|
|
||
| if (len <= 0) { | ||
| rb_raise(rb_eArgError, "empty capture for integer conversion"); |
There was a problem hiding this comment.
| rb_raise(rb_eArgError, "empty capture for integer conversion"); | |
| rb_raise(rb_eArgError, "specified capture is empty: %"PRIsVALUE, idx); |
ext/strscan/strscan.c
Outdated
|
|
||
| if (endp != ptr + len) { | ||
| rb_raise(rb_eArgError, | ||
| "non-digit character in capture: %.*s", |
There was a problem hiding this comment.
Is there any other reason on failure?
ext/strscan/strscan.c
Outdated
|
|
||
| if (endp != ptr + len) { | ||
| rb_raise(rb_eArgError, | ||
| "non-digit character in capture: %.*s", |
There was a problem hiding this comment.
If the target string has a trailing space, it's difficult to find a problem. How about surround the target string something like the following?
| "non-digit character in capture: %.*s", | |
| "non-digit character in capture: <%.*s>", |
No. But if we create a
Yes. But it's before FYI: https://i.loveruby.net/ja/projects/strscan/doc/ChangeLog.html (Japanese) |
Yeah, and I guess that's the main reason StringScanner directly exposes MatchData-like methods.
Interesting, thank you for the link. |
Yes. But it's out-of-scope of this. |
Add a method that returns a captured substring as an Integer, following String#to_i(base) semantics. Accepts an optional base argument (default 10), Symbol/String for named capture groups, and returns 0 for non-numeric or empty captures. Extract resolve_capture_index helper to share index resolution logic between StringScanner#[] and StringScanner#integer_at.
When base is 10 and the capture contains only digits (with optional sign) that fit in long, parse directly and return via LONG2NUM. This covers the Date._strptime use case without temporary String creation. All other cases fall through to rb_str_to_inum.
Provide a pure Ruby implementation using self[index].to_i(base) for JRuby and other non-CRuby platforms. The C extension version takes precedence when available.
|
@kou |
ext/strscan/strscan.c
Outdated
| * This covers the Date._strptime use case. */ | ||
| if (base == 10) { | ||
| long j = 0; | ||
| int negative = 0; |
There was a problem hiding this comment.
Could you use bool instead of int for boolean?
lib/strscan/strscan.rb
Outdated
| unless method_defined?(:integer_at) | ||
| # Fallback implementation for platforms without C extension (e.g. JRuby). | ||
| # Equivalent to self[index].to_i(base). | ||
| def integer_at(index, base = 10) | ||
| str = self[index] | ||
| return nil if str.nil? | ||
| str.to_i(base) | ||
| end | ||
| end | ||
|
|
There was a problem hiding this comment.
Please don't split #scan_integer documentation and implementation.
ext/strscan/strscan.c
Outdated
| } | ||
| } | ||
| if (all_digits) { | ||
| if (digit_count <= (sizeof(long) >= 8 ? 18 : 9)) { |
There was a problem hiding this comment.
- It seems that
9223372036854775807(maxint64_tvalue) isn't optimized. Is it intentional? - It seems that
00000000000000000001isn't optimized. Is it intentional?
test/strscan/test_stringscanner.rb
Outdated
|
|
||
| def test_integer_at_index_zero | ||
| s = create_string_scanner("42 abc") | ||
| s.scan(/(\d+)/) |
There was a problem hiding this comment.
We don't need (...) here, right?
| s.scan(/(\d+)/) | |
| s.scan(/\d+/) |
test/strscan/test_stringscanner.rb
Outdated
| assert_equal({"number" => "1"}, scan.named_captures) | ||
| end | ||
|
|
||
| def test_integer_at |
There was a problem hiding this comment.
Could you use test_integer_at_XXX like other methods?
test/strscan/test_stringscanner.rb
Outdated
| def test_integer_at_named_capture_undefined | ||
| s = create_string_scanner("2024-06-15") | ||
| s.scan(/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/) | ||
| assert_raise(IndexError) { s.integer_at(:unknown) } | ||
| assert_raise(IndexError) { s.integer_at("unknown") } |
There was a problem hiding this comment.
Can we use unknown for both of test name and test value?
| def test_integer_at_underscore | ||
| # follows String#to_i: underscores are accepted | ||
| s = create_string_scanner("1_0_0") | ||
| s.scan(/(\d+(?:_\d+)*)/) | ||
| assert_equal(100, s.integer_at(1)) | ||
| end |
test/strscan/test_stringscanner.rb
Outdated
| assert_equal(999999999999999999, s.integer_at(1)) | ||
|
|
||
| # 19 digits: exceeds long on 64-bit, becomes bignum | ||
| s = create_string_scanner("9999999999999999999") |
There was a problem hiding this comment.
In general, we should use border values for testing. If "9" * 18 is the largest optimized value, we should use "9" * 18" and "1" * 19" (the next value of "9" * 18").
ext/strscan/strscan.c
Outdated
|
|
||
| /* | ||
| * call-seq: | ||
| * integer_at(index, base = 10) -> integer or nil |
There was a problem hiding this comment.
Could you use specifier not index like we did for []?
Lines 1625 to 1695 in 4243751
ext/strscan/strscan.c
Outdated
| VALUE idx, vbase; | ||
| int base = 10; | ||
|
|
||
| rb_scan_args(argc, argv, "11", &idx, &vbase); |
Skip leading zeros to compute effective digit count, allowing values like "00000000000000000001" to use the fast path. Add overflow-checked parsing for 19-digit values so LONG_MAX fits in the fast path while LONG_MAX+1 correctly falls through to rb_str_to_inum.
Remove nested capture group and check group 3 directly for nil.
Non-digit behavior is already covered by test_integer_at_non_digit and index 0 is covered by test_integer_at_index_zero.
Extend base-10 fast path to parse underscore-separated digits(e.g. "1_000_000") without temporary String allocation, following String#to_i underscore rules.
Replace "9" * 19 with "1" * 19 as the correct next value after "9" * 18, and add LONG_MIN - 1 test to pair with LONG_MIN.
Rename the parameter in RDoc, C implementation, and Ruby fallback to match the naming convention used in StringScanner#[].
The specification for
MatchData#integer_athas been defined here.StringScanner#integer_atfollows this specification.see: https://bugs.ruby-lang.org/issues/21932#note-6, https://bugs.ruby-lang.org/issues/21932#note-7
see: https://bugs.ruby-lang.org/issues/21943