llm

Ranter

Wisecrack

9520

Comments

3

typosaurus

10592

1y

It's very interesting how weights get calculated and how they are related to each other. Especially how to store them in a way to look up quickly. You kinda explained that (I didn't understand everything). But see this beautiful project that can read a harry potter book and can write new sentence. Learns character for character (not word by word):

https://github.com/Ricardicus/...

In the source directory is also a python version.

Do you prefer to use sigmoid or relu?
3

Wisecrack

9520

1y

@retoor first, I checked it out, that looks interesting. Especially because so few people are willing to use LSTMs anymore with the new shiny thing on the block.

I prefer relu because it's pretty standard all things being equal.

Incidentally theres no non-linearity happening here. Theres not even multiplication happening.

It's all about using noise to map a given distribution to another distribution, with hard part simply being a permutation search for a given key layer that forms an isomorphic projection between the input and output distribution.

The idea is that instead of doing a computation, we do a sampling of noise, and filter by simple comparison for noise that does the mapping for us. (theres also a lot of different types of noise, all with different properties, like pink noise, white noise, black noise, etc).
3

Wisecrack

9520

1y

I've also left out a TON of information that makes this sound like handwaving and woo.

Things like how you *also* have graph tokens, and a graph token dictionary. Graph tokens are special nodes that seek through scattershot embeddings of a context window, and perform specific functions, like counting instances of words or tokens, checking presence, or lack of tokens, performing matching, performing token substitutions, etc.

The same encoder-decoder that pairs input tokens to output tokens, also does the same for sequences of graphs and subgraph nodes.

How does this interact with inputs and outputs?

Well, during training, for each input, many graphs contexts are generated, and this is done in parallel to training

on tokens.

However, the critical bit is, during inference the input is processed and an output is generated first, and only then are the graphs run on the token context window for the respective input (queries) and outputs (values).
3

Wisecrack

9520

1y

What do the graphs really do here though?

If the same embedding technique is used to encode and decode the graphs, as are used for the context tokens,

then we really only need to ask what the graphs do internally?

First, we assume the graphs themselves are a sort of embedding based on our particular embedding technique.

Second you'll be prone to saying "given the current technique, you can train the system to output anything, any

random string of tokens or words based on any random input." Thats correct, but its also true for

all LLMs. They're only as good as their training data specifically, and their architecture and underlying

math generally. It is, to say, that the world is its own ground truth, and like how axioms within a particular

system cannot prove the set is complete (incompleteness), there is no way to say, definitely, anything

a machine outputs is congruent with the reality outside that *merely* represented by its own training data.
3

Wisecrack

9520

1y

Third, just as each graph token in the graph token dictionary performs a different function, from looking

for particular word token entries (or sets of them), to constraining token order, to counting instances,

its function is divorced from its representation. A graph, being an embedding, also HAS an embedding.

What it does is add additional state information, and generate *new* state, based on the input.

Graph objectives are therefore trained and tested and validated in parallel, to input/output pairs.

Why is this important?
3

Wisecrack

9520

1y

Because if you compressed a lot of training data using the Ln autoencoder method, along with relevant

graphs, then the noise embedding retrieved for the most similar training data, an embedding introduced by a *new* input and

its variance, acts to modify the mean of the graph representation, which changes which graph tokens are

generated from that representation's embedding.

Think about it as dual encoder-decoder pairs, where during training they run in parallel,

one for tokens, and one for graphs. And during inference, the decoder outputs tokens from the encoder's input,

and only *then* does the selected graph for the decoder's network mean output modify said output.
4

Wisecrack

9520

1y

...In much the same way that modern research reveals through MRI, that we often formulate our responses automatically

without being aware of the fact, and then filter and modify them deliberately on the fly only *after* becoming aware our brain

has determined a response. There is supporting evidence of this in the motor system, where the premotor cortext

is known to generate many possible responses and motions in any given situation, and uses inhibitory networks to

filter and select for optimal movements for the given situation.
3

Wisecrack

9520

1y

DevRant butchering my sentences and linebreaks notwithstanding), I hope any of that is readable at all.

The final piece is aligning the function of the generated trained/tested graphs to their embeddings, so in this way, finding the best embedding for mapping an input to a generated output, is equivalent to performing some function on the output that brings the final output closer to the distribution of the input.
3

Wisecrack

9520

1y

The bonus is that all the learning and functions happen at the symbolic level, without hard coding any rules, making such a network highly modifiable and inspectable.

And for any given artificial rule, it becomes trivial to put one together, and then find a matching embedding that works for the training/test/validation data.

if any of this doesn't make any sense I've likely butchered it further, and left out pieces while I was trying to splice around devrants post and comment character limit.

Probably the first *solid* test of the system would be a character-level assessment (ascii characters as tokens) to teach it to identify word spacing.

Some graph tokens I considered included ones that would produce continuations past a default length, modify the target distribution, ones that would embed 'ephemeral tokens' (they don't show up as printable characters or visible tokens but modify the context and the graphs function in some way, or act as state placeholders, etc).
3

Wisecrack

9520

1y

Divorcing representation from function was probably the major thing.

Ironically it is about as fast as most of the 8b models I ran on my system, producing maybe 20 tokens per minute.

Right now just the token encoder-decoder is built and functional, but it was proof enough that the entire premise has merit.

After character level, I'll train it to do word detection (spaces, character chunking).

That'll be proof that the graph method is also viable.

From there component word extraction (noun, verb, adjective, subject, etc).

A lot of this was initially inspired by a brief article I read on gaussian splatting, and research into the subgraph problem that LLMs deal with. (they use subgraphs in training data to pattern match for some definition of pattern matching) in order to solve problems in a way, once revealed, that is counter-intuitive, and not in the fashion the generated output itself may at first even claim.
3

Wisecrack

9520

1y

Theres a few bits and bobs I left out, like how levenshtein distance is used to mutate the transition matrix, or how parts of it (if you squint real hard) are pagerank (or page-rank-like) shoe-horned into a completely non-pagerank shaped problem.
3

Wisecrack

9520

1y

Also, this is probably the least shit-posty post I've made in a while, thanks for being the first to respond!

I got to go to bed before work, so if I don't respond to any comments right away, I will when I'm back up.
3

typosaurus

10592

1y

@wisecrack even with your beautiful abstract way of explaining it, my level regarding that subject is still to low to understand. I know how to train a single layer neuron but that's all. It's nice knowledge, the core. There's a BEAUTIFUL video about it by tsoding. He trains a neuron to understand boolean input and in the end the neuron was for 99.999% sure than 1 is true. But i have no idea what the idea behind the math is or why even weights are random at the beginning and not a certain value. I imagine that it would be faster if you give it a direction. A direction should be possible because it's made to learn a specific thing in the end.

The difference between using Relu and sigmoid can be very cool. I think Relu is in most cases the standard now. It's faster but less specific. Less gray area
3

Wisecrack

9520

1y

@retoor thanks for that.

I think I read an article a few years back that explained the same exact principle in the same style of experiment. Identifying boolean values.

Backpropogation is still a beautiful thing.

I'll make illustrations with source code before long (with way less over-explaination) when the graph portion of the code comes along.

But for example the entire process could be inverted, with the context window composed of graph tokens, and the prior graph layer being composed of token dictionary embeddings, or at least this in affect by finding token dictionary embeddings, used as a noise layer, that map one graph sequence to another.
3

Wisecrack

9520

1y

The output of the sequence might of course look like the stream-of-subconcious babble a guy might produce while flying high on DMT and talking to machine elves, or some other variation of hallucination: the words produced would look random and have no correlation to their real-world meaning. They'd just be a higher-dimensional vector to represent the mapping, using words (and their random embeddings) instead of straight numbers.

But hypothetically, these random looking phrases, would be functions that compress the graph sequences in question wherever we find an autoencoder input where the output is shorter.

Likewise hidden dimensions of a graph sequence could be blown up by finding these 'random word phrases' (representing random vectors) that map shorter graph sequences to longer graph sequences.

God damn, it really does come off as mere babble.
2

Wisecrack

9520

1y

Also whats your understanding of relu vs sigmoid?
1

typosaurus

10592

1y

@Wisecrack Relu is like an int, Sigmoid like a double. Sigmoid is just more precise IIRC. I'm wrong?
3

typosaurus

10592

1y

@Wisecrack I think with brains like yours you could be in jail without being bored
2

CoreFusionX

3626

1y

@retoor

Not math wizard like Mr. @Wisecrack, but can try.

Relu is basically a 0-clamped ramp function, while sigmoid is exponential.

Deep NNs use relu mainly because it's significantly faster to compute (no exponents), significantly faster to backpropagate (just an if), and produces less gradient saturation than sigmoid (as in, more places in the function where the derivative is far enough from 0), which speeds up learning.
1

figoore

237

1y

the comment butchering made the best cliff hanger there
> why is that important?

And i can confirm… f*ck it… it worked on me 😂
Pretty intresting writings, love it! @Wisecrack
2

figoore

237

1y

@retoor can’t aggree more on the jail conspiracy theory 😂
Thank god @Wisecrack is doing maths and not meths 😂
2

typosaurus

10592

1y

@figoore the one doesn't exclude the other one. Maybe he's doing math with meth. Which one came first?

I could for a while thinking very long too when I still was discovering how to write an interpreter (still discovering a bit, never ending thingy prolly). Thinking about it could make me skip a night just laying in bed. It was amazing, better than sleeping. Brainfuck also kept me busy for a while. It's easy to understand but actually doing smth with it...Writing a regular expression parser also triggered my brain very. It's impossible to just hack one together - it's too complex. Also, i'm happy that GPT can't write one - I recently made a sudoku solver that only did the easy puzzles. Then i saw the GPT version and it was thousand times better. If you know THE way of a sudoku solver, it's the shortest way, i'm sure. But sadly I lost. I try frequently to beat GPT with coding by comparing afterwards. It's fun. I do keep my own solutions. I care about own design
1

Wisecrack

9520

1y

@CoreFusionX "more places in the function where the derivative is far enough from 0"

isn't it ReLu that produces derivatives close to the mean too, or am I confusing it with another non-linearity?
2

Wisecrack

9520

1y

@figoore I never said I'm not on ritalin.

But for the record I'm not.

I'm glad you enjoyed the post, even though theres no nice and shiny graphics to explain what the fuck I'm even talking about.
2

Wisecrack

9520

1y

@retoor like a generator-discriminator pair!

Thats fucking cool.

"beat the machine" as it were.

I like it. I like it alot.

Same on the sleepless nights. Laying down, staring up at the white painted ceiling in the dark, thoughts going a million miles an hour, while the street lights filter in through my blinds.

Or pacing on my porch, smoking cigarette after cigarette, beneath the moonlight while turning new ideas over in my head, what will work, what won't work, what might work, what offers new avenues to attack the problem, new possibilities.

Nothing quiet like it.

And then, the a-ha moment, like athena springing from the head of zeus, fully formed, and racing off to code it all.
2

figoore

237

1y

@Wisecrack those nights are the best parts of being a programmer
I really like this wording, such a nice devrant gem
2

typosaurus

10592

1y

@Wisecrack yeaaahhh! I smoke too. 15 a day or so. 0,5333 euro cents each. I once said, if a pack of sigarettes becomes 5,- I quit. That would've been 0.20 per sigaret. Whoops. My biggest issue is that you pay for a box with a picture of it with people with cancer on it on a way that will 99% not apply to you. Ever met someone in real life who ended in situation like on picture? It's all so manufactures can say you're warned. Not for health, for lawsuits. Ugh fuckers, fine, we take responsibility. Remove the pictures. Anyway, I keep smoking for a while, I just like it. When it comes to vaping, I have also a lot to say about that
2

Wisecrack

9520

1y

@figoore I'm genuinely flattered. Thanks.
2

Wisecrack

9520

1y

@retoor Nothing like smoking after four hours of going without, the rush, eyes half-lidded, as everything comes into sharp focus.
2

typosaurus

10592

1y

@Wisecrack I can last without cigarette for two days. But will almost faint during smoking after that. The thing I hate, when I didn't notice smoking the last one and not really appreciated it while smoking. And often I just smoke the last one to get over it. Procrastinating the last cigarette sucks :P

Smoking one very hour at work is nice. Many non-smokers joined me and we always had in dept work related talks. It were the best retro sessions. For a while was vaping in the office allowed, but the vapes got bigger and bigger and at a certain moment they considered it enough
1

Wisecrack

9520

1y

@retoor some dude brought in a bad-dragon sized vape didn't he?

I bet he did.

And they were like "we draw the line at smoking pole in the office!"
1

typosaurus

10592

1y

@Wisecrack 80 watts devices indeed. Huge self made liquid bottles, enough for the whole 150 person office
1

Wisecrack

9520

1y

@retoor company policy 178: "you will NOT convert the company's open floor plan into a hookah shop!"

Related Rants

Add Comment

random

machine learning