Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concatenation of character<> into string<> missed if in subexpression #147

Closed
Andersama opened this issue Nov 15, 2020 · 6 comments
Closed

Comments

@Andersama
Copy link
Contributor

Andersama commented Nov 15, 2020

A regex like:

abcd

appears to generate string<a,b,c,d>
but*

abcd.?

generates
sequence<character<a>,character<b>,character<c>,character<d>,optional<any>>
as opposed to:
sequence<string<a,b,c,d>,optional<any>>

Bit of a trip up as I was trying to integrate #143 with a simple check to see if string<> was the first term.

hanickadot pushed a commit that referenced this issue Nov 15, 2020
…ifferent items in sequence:

`ab.cd` is now `sequence<string<'a','b'>,any,string<'c','d'>>`

before it was:
`sequence<character<'a'>,character<'b'>,any,character<'c'>,character<'d'>>`
@Andersama
Copy link
Contributor Author

Andersama commented Nov 15, 2020

Haven't updated yet, just curious, does this handle strings nested further in other sequence like things? EG: (((abc)def)ghi)?
Here's what I'm up to, I'm trying to combine some pattern analysis such that I could transform the above into something "StringLike". The idea being given something that is a series of CharacterLike atoms and a random accesss iterator we can do 1 bounds check, then a massive unrolled series of character comparisons (nothing dependent on one another). Things like ()'s can have their ) end markers interleaved w/ the unrolled character comparisons or left at the end because StringLike must have matched for any of the other matches.

EG: The above would eventually turn into

size_t remaining_characters = ::std::distance(current, end);
if (remaining_characters < 9) {
    return not_matched;
}
bool matched = *(current+0) == 'a' &&
*(current+1) == 'b' &&
*(current+2) == 'c' &&
(capture sequence 3 here, true) &&
*(current+3) == 'd' &&
*(current+4) == 'e' &&
*(current+5) == 'f' &&
(capture sequence 2 here, true) &&
*(current+6) == 'g' &&
*(current+7) == 'h' &&
*(current+8) == 'i' &&
(capture sequence 1 here, true);
return evaluate(begin, ::std::advance(current, 9), end, captures, ...);

Did this with characters, but obviously the comparisons there could have just as easily been a set or negated set.

@hanickadot
Copy link
Owner

You should try it. About the strings, I'm not sure, for some iterators calculating the distance is costly (namely utf8_iterator), and doing *(current+N)... unpack would be costly.

@Andersama
Copy link
Contributor Author

Andersama commented Nov 15, 2020

Does the utf8 iterator work assuming the compressed format? If so then the bounds check could be based instead on the expected byte size instead...I think? And then the +'s could be increments as before...or I guess byte indexs. Assuming it's all a contiguous array anyway.

I mean mid increment for the utf8 iterator it's not going to match a utf16 or utf32 encoded character.

@hanickadot
Copy link
Owner

Don't know what you mean with the compressed format. Currently string<S...> can be used to compare agains all char types or even UTF-8 codepoint iterator, the comparison is always happening against UTF-32 code internally.

@Andersama
Copy link
Contributor Author

I guess the question is whether it'd be safe to assume that given a utf8 string if it'd be safe to use byte indexs / ignore the utf8 iterator temporarily.

Andersama added a commit to Andersama/compile-time-regular-expressions that referenced this issue Dec 28, 2020
Andersama added a commit to Andersama/compile-time-regular-expressions that referenced this issue Dec 28, 2020
Andersama added a commit to Andersama/compile-time-regular-expressions that referenced this issue Dec 28, 2020
@Andersama
Copy link
Contributor Author

@hanickadot I think I've worked out how to manage this with utf8, assuming tests pass. Probably went the longer way around since you likely have a utf8 encoding utility somewhere. I went about constructing a char8_t buffer from the string<String...> so that I could do the equivalent of the above. Should also save on some iterator++'s as well. Crossing fingers tests pass. Anyway, happy holidays!
#159

Andersama added a commit to Andersama/compile-time-regular-expressions that referenced this issue Dec 28, 2020
Andersama added a commit to Andersama/compile-time-regular-expressions that referenced this issue Dec 28, 2020
Andersama added a commit to Andersama/compile-time-regular-expressions that referenced this issue Dec 28, 2020
Andersama added a commit to Andersama/compile-time-regular-expressions that referenced this issue Dec 29, 2020
Andersama added a commit to Andersama/compile-time-regular-expressions that referenced this issue Dec 29, 2020
Andersama added a commit to Andersama/compile-time-regular-expressions that referenced this issue Dec 29, 2020
Andersama added a commit to Andersama/compile-time-regular-expressions that referenced this issue Dec 29, 2020
Andersama added a commit to Andersama/compile-time-regular-expressions that referenced this issue Dec 29, 2020
Andersama added a commit to Andersama/compile-time-regular-expressions that referenced this issue Dec 29, 2020
Andersama added a commit to Andersama/compile-time-regular-expressions that referenced this issue Dec 29, 2020
Andersama added a commit to Andersama/compile-time-regular-expressions that referenced this issue Dec 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants