Ranter
Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Comments
-
https://devrant.com/rants/7682021/...
I'm still a bit confused, but from my understanding a string would be UTF-8 (which implies the existence of code points)...
So substring in string is a regular loop checking char by char I'd guess?
It's the easiest way.
Plus maybe the typical optimizations, for example validating length.
If length(substring) > length(string) - false
if remainingChars < length(substring) - false
Et cetera -
@IntrusionCM It's two nested loops, other than that the only complication is that they're both clonable reversible iterators and not indices because I iterate over graphemes produced by unicode_segmentation and not code points
-
PS: it's not even that because chars() is also an iterator. I'm literally just struggling to write a subsequence testing loop.
-
another issue is that I always try to optimize prematurely so my first attempt was to find the rarest element in the needle and then search the haystack for that element.
-
@lorentz The nice thing about UTF-8 is that you don't have to care about code points for many string operations. Just treat them as bytes.
It's only when you want to e.g. count letters that this won't fly anymore. -
@lorentz hm...
HMMMMMMMMM....
I'm really not sure if I like that idea regarding strings.
It just sounds like a rather complicated attempt to fix a minor inconvenience.
Let's take an example: The turkish letter i problem. It's not only in the Turkish language present, but known mostly under that name.
http://en.wikipedia.org/wiki/...
Leading to e.g. this:
https://lucene.apache.org/core/...
If I get it right, you would just convert it down to latin I despite having a special meaning?
It reminds me a lot of the - nowadays - very hated decision of MySQL to strip UTF8 to a max of 3 bytes. They took out the 4th dataplane.
Which leads to the wonderful world of pain:
utf8 is actually utf8mb3
utd8mb4 is the "real" utf8
I know a lot of people (especially as I come from a PHP background) who are freaked out by UTF-8 and the difference between string length vs byte length.
But imho that's wrong. It doesn't hurt nor is it bad. Just use it appropriately and don't try to be overly clever...
That's imho why most people struggle with UTF-8... They worry about problems that don't really exist.
Unless you have extremely large strings (talking about dozens of megabytes plus), there is neither a reason to worry about processing a string nor about finding out the (byte-) length of a string.
Most compilers / environments have picked up heavily SIMD optimizations... Making performance problems really a thing of the past, especially
https://en.m.wikipedia.org/wiki/...
AVX. -
@lorentz means it's time to take a break, go do something else, or even go to sleep/nap.
I struggled the night before to write an eight line function doing some simple filtering.
Came back in the morning, and it literally took me two minutes to write it, but free first try.
It's almost always your brain saying its had enough. -
@IntrusionCM I think you misunderstand my stance. Unicode encodes text in codepoints. UTF-8 encodes codepoints in bytes. Historically, programs handled strings by the byte. Today, programs handle strings by the codepoint. My plan is to handle strings by the grapheme, which is a sequence of codepoints describing a unit of human writing. A grapheme usually consists of a base codepoint followed by various modifiers or ZWNJ+codepoint pairs. Graphemes usually coincide with font glyphs.
This is different from handling code points in that eg. Mandarin graphemes that consist of mostly two codepoints would not be cut in half when transforming strings. For the most part they would behave exactly like codepoint-sequences other than addressing this specific class of bugs.
1/2 -
2/2
The drawback is that because Unicode places no limitations on the number of modifiers (Zalgo characters can be several kilobytes) the underlying datatype of char has to be an unbounded string as opposed to 4 byte codepoints. -
@IntrusionCM The statement of the rant you linked is basically just that since string processing is already variable width to handle code points, handling graphemes is marginally more complex and eliminates a new class of visual glitches such as mandarin letters getting cut in half by trimming or line breaks.
-
@lorentz yeah I misunderstood the grapheme part.
You're essentially representing a composed unicode character as a single character though it is actually a character consisting of multiple characters? -
@IntrusionCM That is exactly what I'm doing. Pretty much the same features as provided by the unicode_segmentation crate, but it's the default behaviour in the standard string library.
-
@lorentz Why don't you just break down the grapheme that you look for into its UTF-8 representation and proceed on raw byte level from there? Do identical graphemes have multiple possible representations?
-
@Fast-Nop Since both are clonable, reversible iterators of equality-comparable elements, I don't have to do anything extra. My brain just stopped working temporarily.
-
@Fast-Nop Most graphemes do have multiple representations, but I"m not handling that because the closest well-defined concept is "similar character" which is not an equivalence relation, and I don't want to complicate my API quite that much
-
I'm not opposed to the idea of defining some equivalent sets of graphemes for the purpose of text processing, but it's a massive complication for relatively little gain so it's definitely not high priority.
-
Also, messing with the concept of identical text is a massive security risk, eg. it may allow a bad actor to register under a username which is treated as a distinct account by surrounding code but has the same rights as another account if the validation is done within Orchid.
It's not even specific to strange characters either, here's a list of look-alikes to Latin characters. The first entry in several rows is so similar that any sensible notion of identical characters would need to include it:
https://gist.github.com/StevenACoff... -
@lorentz
Don't forget about Unicode normalization.
Some codepoint + modifier(s) combos exist as standalone codepoints.
Without normalization, naive codepoint comparison would sometimes fail. -
@CoreFusionX Yeah but if it fails everywhere else uniformly and only succeeds in Orchid that could easily create security vulnerabilities even.
-
Now I feel like I have to announce that I did finally manage to write the god damn motherfucking find function.
It's probably not the most efficient find function ever, but like, it's in Rust, it will be called from an interpreted language, and I have tens to hundreds of functions to write before the language is even remotely usable, so optimization can wait. -
Hah! I can relate. I often find myself googling "How to iterate over an array javascript"
-
I think I know what you mean, I had a similar issue with my own language when I was writing it, the first thing to keep in mind is to <dead>
-
Update on this, there is apparently a well-defined concept of equivalent graphemes for Unicode which is an equivalence relation and a lot of languages use it, so that's coming too.
I now feel validated in my quest for text processing exclusively via the public API of Unicode and without acknowledging the encoding. -
That being said, I refuse to acknowledge locale as a property of the environment so I'm on the fence as to whether adding a locale property to strings or supporting only locale-independent algorithms is the less bad option
-
@IntrusionCM I do want to support some sort of localization, I just think that locale is a property of the text.
Related Rants
I'm struggling to write a function that finds a subsequence in a sequence. I made a fucking programming language and String::find is where I get stuck. Fucking fuck.
Impostor syndrome hitting hard today
rant
stl
langdev
impostor syndrome
orchid