Concatenation of character<> into string<> missed if in subexpression #147

Andersama · 2020-11-15T05:35:31Z

A regex like:

abcd

appears to generate string<a,b,c,d>
but*

abcd.?

generates
sequence<character<a>,character<b>,character<c>,character<d>,optional<any>>
as opposed to:
sequence<string<a,b,c,d>,optional<any>>

Bit of a trip up as I was trying to integrate #143 with a simple check to see if string<> was the first term.

The text was updated successfully, but these errors were encountered:

…ifferent items in sequence: `ab.cd` is now `sequence<string<'a','b'>,any,string<'c','d'>>` before it was: `sequence<character<'a'>,character<'b'>,any,character<'c'>,character<'d'>>`

Andersama · 2020-11-15T14:23:05Z

Haven't updated yet, just curious, does this handle strings nested further in other sequence like things? EG: (((abc)def)ghi)?
Here's what I'm up to, I'm trying to combine some pattern analysis such that I could transform the above into something "StringLike". The idea being given something that is a series of CharacterLike atoms and a random accesss iterator we can do 1 bounds check, then a massive unrolled series of character comparisons (nothing dependent on one another). Things like ()'s can have their ) end markers interleaved w/ the unrolled character comparisons or left at the end because StringLike must have matched for any of the other matches.

EG: The above would eventually turn into

size_t remaining_characters = ::std::distance(current, end);
if (remaining_characters < 9) {
    return not_matched;
}
bool matched = *(current+0) == 'a' &&
*(current+1) == 'b' &&
*(current+2) == 'c' &&
(capture sequence 3 here, true) &&
*(current+3) == 'd' &&
*(current+4) == 'e' &&
*(current+5) == 'f' &&
(capture sequence 2 here, true) &&
*(current+6) == 'g' &&
*(current+7) == 'h' &&
*(current+8) == 'i' &&
(capture sequence 1 here, true);
return evaluate(begin, ::std::advance(current, 9), end, captures, ...);

Did this with characters, but obviously the comparisons there could have just as easily been a set or negated set.

hanickadot · 2020-11-15T14:28:38Z

You should try it. About the strings, I'm not sure, for some iterators calculating the distance is costly (namely utf8_iterator), and doing *(current+N)... unpack would be costly.

Andersama · 2020-11-15T14:53:52Z

Does the utf8 iterator work assuming the compressed format? If so then the bounds check could be based instead on the expected byte size instead...I think? And then the +'s could be increments as before...or I guess byte indexs. Assuming it's all a contiguous array anyway.

I mean mid increment for the utf8 iterator it's not going to match a utf16 or utf32 encoded character.

hanickadot · 2020-11-15T16:44:45Z

Don't know what you mean with the compressed format. Currently string<S...> can be used to compare agains all char types or even UTF-8 codepoint iterator, the comparison is always happening against UTF-32 code internally.

Andersama · 2020-11-15T17:25:38Z

I guess the question is whether it'd be safe to assume that given a utf8 string if it'd be safe to use byte indexs / ignore the utf8 iterator temporarily.

… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG

Andersama · 2020-12-28T17:23:02Z

@hanickadot I think I've worked out how to manage this with utf8, assuming tests pass. Probably went the longer way around since you likely have a utf8 encoding utility somewhere. I went about constructing a char8_t buffer from the string<String...> so that I could do the equivalent of the above. Should also save on some iterator++'s as well. Crossing fingers tests pass. Anyway, happy holidays!
#159

… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG

hanickadot added the enhancement label Nov 15, 2020

hanickadot closed this as completed Nov 15, 2020

Andersama added a commit to Andersama/compile-time-regular-expressions that referenced this issue Dec 28, 2020

optimizes string matching by allowing memcmp like functionality (even…

ce3ebec

… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG

Andersama mentioned this issue Dec 28, 2020

[WIP] Accelerate string matching #159

Open

Andersama added a commit to Andersama/compile-time-regular-expressions that referenced this issue Dec 28, 2020

optimizes string matching by allowing memcmp like functionality (even…

09e933a

… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG

Andersama added a commit to Andersama/compile-time-regular-expressions that referenced this issue Dec 28, 2020

optimizes string matching by allowing memcmp like functionality (even…

d6e093b

… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG

Andersama added a commit to Andersama/compile-time-regular-expressions that referenced this issue Dec 28, 2020

optimizes string matching by allowing memcmp like functionality (even…

4fb4c56

… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG

Andersama added a commit to Andersama/compile-time-regular-expressions that referenced this issue Dec 28, 2020

optimizes string matching by allowing memcmp like functionality (even…

a50f0a6

… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG

Andersama added a commit to Andersama/compile-time-regular-expressions that referenced this issue Dec 28, 2020

optimizes string matching by allowing memcmp like functionality (even…

c43c70a

… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG

Andersama added a commit to Andersama/compile-time-regular-expressions that referenced this issue Dec 29, 2020

optimizes string matching by allowing memcmp like functionality (even…

fab4c73

… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG

Andersama added a commit to Andersama/compile-time-regular-expressions that referenced this issue Dec 29, 2020

optimizes string matching by allowing memcmp like functionality (even…

0f3fc52

… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG

Andersama added a commit to Andersama/compile-time-regular-expressions that referenced this issue Dec 29, 2020

optimizes string matching by allowing memcmp like functionality (even…

b783686

… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG

Andersama added a commit to Andersama/compile-time-regular-expressions that referenced this issue Dec 29, 2020

optimizes string matching by allowing memcmp like functionality (even…

a6ede49

… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG

Andersama added a commit to Andersama/compile-time-regular-expressions that referenced this issue Dec 29, 2020

optimizes string matching by allowing memcmp like functionality (even…

dc8dcaf

… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG

Andersama added a commit to Andersama/compile-time-regular-expressions that referenced this issue Dec 29, 2020

optimizes string matching by allowing memcmp like functionality (even…

f9cdaa8

… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG

Andersama added a commit to Andersama/compile-time-regular-expressions that referenced this issue Dec 29, 2020

optimizes string matching by allowing memcmp like functionality (even…

9c9d99f

… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG

Andersama added a commit to Andersama/compile-time-regular-expressions that referenced this issue Dec 29, 2020

optimizes string matching by allowing memcmp like functionality (even…

6e6f6e1

… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concatenation of character<> into string<> missed if in subexpression #147

Concatenation of character<> into string<> missed if in subexpression #147

Andersama commented Nov 15, 2020 •

edited

Loading

Andersama commented Nov 15, 2020 •

edited

Loading

hanickadot commented Nov 15, 2020

Andersama commented Nov 15, 2020 •

edited

Loading

hanickadot commented Nov 15, 2020

Andersama commented Nov 15, 2020

Andersama commented Dec 28, 2020

Concatenation of character<> into string<> missed if in subexpression #147

Concatenation of character<> into string<> missed if in subexpression #147

Comments

Andersama commented Nov 15, 2020 • edited Loading

Andersama commented Nov 15, 2020 • edited Loading

hanickadot commented Nov 15, 2020

Andersama commented Nov 15, 2020 • edited Loading

hanickadot commented Nov 15, 2020

Andersama commented Nov 15, 2020

Andersama commented Dec 28, 2020

Andersama commented Nov 15, 2020 •

edited

Loading

Andersama commented Nov 15, 2020 •

edited

Loading

Andersama commented Nov 15, 2020 •

edited

Loading