-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concatenation of character<> into string<> missed if in subexpression #147
Comments
…ifferent items in sequence: `ab.cd` is now `sequence<string<'a','b'>,any,string<'c','d'>>` before it was: `sequence<character<'a'>,character<'b'>,any,character<'c'>,character<'d'>>`
Haven't updated yet, just curious, does this handle strings nested further in other sequence like things? EG: EG: The above would eventually turn into size_t remaining_characters = ::std::distance(current, end);
if (remaining_characters < 9) {
return not_matched;
}
bool matched = *(current+0) == 'a' &&
*(current+1) == 'b' &&
*(current+2) == 'c' &&
(capture sequence 3 here, true) &&
*(current+3) == 'd' &&
*(current+4) == 'e' &&
*(current+5) == 'f' &&
(capture sequence 2 here, true) &&
*(current+6) == 'g' &&
*(current+7) == 'h' &&
*(current+8) == 'i' &&
(capture sequence 1 here, true);
return evaluate(begin, ::std::advance(current, 9), end, captures, ...); Did this with characters, but obviously the comparisons there could have just as easily been a set or negated set. |
You should try it. About the strings, I'm not sure, for some iterators calculating the distance is costly (namely utf8_iterator), and doing *(current+N)... unpack would be costly. |
Does the utf8 iterator work assuming the compressed format? If so then the bounds check could be based instead on the expected byte size instead...I think? And then the +'s could be increments as before...or I guess byte indexs. Assuming it's all a contiguous array anyway. I mean mid increment for the utf8 iterator it's not going to match a utf16 or utf32 encoded character. |
Don't know what you mean with the compressed format. Currently string<S...> can be used to compare agains all char types or even UTF-8 codepoint iterator, the comparison is always happening against UTF-32 code internally. |
I guess the question is whether it'd be safe to assume that given a utf8 string if it'd be safe to use byte indexs / ignore the utf8 iterator temporarily. |
… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG
… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG
… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG
@hanickadot I think I've worked out how to manage this with utf8, assuming tests pass. Probably went the longer way around since you likely have a utf8 encoding utility somewhere. I went about constructing a char8_t buffer from the string<String...> so that I could do the equivalent of the above. Should also save on some iterator++'s as well. Crossing fingers tests pass. Anyway, happy holidays! |
… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG
… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG
… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG
… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG
… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG
… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG
… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG
… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG
… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG
… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG
… on utf8 sequences) reference: hanickadot#147 comparison: https://compiler-explorer.com/z/Tz3KhG
A regex like:
appears to generate string<a,b,c,d>
but*
generates
sequence<character<a>,character<b>,character<c>,character<d>,optional<any>>
as opposed to:
sequence<string<a,b,c,d>,optional<any>>
Bit of a trip up as I was trying to integrate #143 with a simple check to see if
string<>
was the first term.The text was updated successfully, but these errors were encountered: