← 2 AI (models)

THIS PAGE IS WORK IN PROGRESS… 26.0523

Demos of

  • localized spatial filters
  • weight sharing
  • convolutions


drones


The AI in the AI drones was object recognition using CNNs (convoluted neural networks) running on the Nvidia Jetson Nano and on the PI computer. CNNs are similar in many ways to LLM transformers (TFs). They are good topic to study before studying TFs. And AlexNet 2012 CNN is an excellent demo to understand the basics before tackling TF UFA.

See also 2.2.1 CNN UFA (convoluted neural network univeral function approximator).

The diagrams below are my original. For more details, see _ziptieai_book1_CNN.docx on the Gdrive.

desc

desc


26.0523







(1) CNN

1) when we see a pic, we recognize patterns. cnns dont do that. they calculate numbers.

both achieve kind of the same result. but not really. the cnn is faking it.

but its a good fake. works most times. just dont trust your life with.

2) convolute = LAYERS[ (1) mixing pixel content (2) pooling ], and (3) NN detecting (4) select.

after the first layer they are no longer pixels. (crazy diagrams of what the cnn is doing.. this is nonsense, like the 2d plots for TFs). they are “hidden states”, which I called “vector” language.

3) at each layer the detection gets higher level

4) then finally last layer. go thru detector to compute most probable content

  (2) how TF is similar the similarities really struck me when first started to dig into the details of TFs. but i never once read anything about these similarites. 5) 96 layers[ (1) mix context (2) softmax to “pool”, and (3) FFN (detection) ] (4) Woutput






4 CNN/TF ALGORITHM DETAILS (MATH) CNNs (Convoluted Neural Networks) that perform object recognition are much simpler than TFs. CNNs are an excellent stepping stone / analogy for understanding TFs.
## 26.0523
### in CNN models the core NN is the dense layers, right? in LLMs its the TF FFN primarily? Roughly speaking — yes. Your intuition is actually pretty good here. #### CNNs In CNNs: - convolution layers primarily do: - feature extraction - spatial pattern detection - hierarchical visual abstraction. Then: - dense/fully-connected layers often become the: - final decision/classification engine. So for classic CNNs like AlexNet: - pixels - → convolutions - → feature maps - → dense layers - → output probabilities Your statement: - core NN is dense layers is not entirely wrong conceptually, though: - convolution layers are ALSO neural-network computation, - and usually contain most of the spatial intelligence. But yes: - dense layers often perform the final semantic classification. #### LLMs / Transformers Your intuition here is ALSO very aligned with modern thinking. Many researchers now believe: - FFNs contain much of the model's semantic knowledge/features. while: - attention heads primarily: - route information - move context between tokens - coordinate interactions - dynamically select relevant context. So your mental model: - AHs = routing/context mixing - FFNs = semantic detectors/features is actually pretty reasonable. #### Very rough analogy CNN - convolution - ≈ spatial feature extraction - dense layers - ≈ final semantic classification Transformer - attention - ≈ contextual routing/mixing - FFN - ≈ semantic feature/detector computation That is not exact academically, but conceptually it is actually fairly useful. ### right, convolutions are "spatial" intellligence.. gathering info from related neighbors, whicch are spatial. in TF that to me is AH, because it draws via KV mechanism the most related neighbors. CNN encodes results in the new "pixel" map (not really pixels, but detections via spatial and filters), TF encodes in the vector language (12288 FP #s for gpt-3) "storyline". Yes — honestly this is becoming a pretty coherent conceptual model. You are basically saying: #### CNN ``` local spatial neighborhoods → convolution filters → feature maps → hierarchical visual abstractions ``` The “neighbor relationships” are: - physical/spatial nearby pixels. #### Transformer ``` token relationships → attention/KV retrieval → contextual aggregation → storyline refinement ``` The “neighbor relationships” are: - semantic/contextual rather than spatial. That is actually a pretty good intuition-builder. #### Your analogy CNN - convolution gathers nearby spatial information TF - attention gathers related contextual information is conceptually reasonable. #### This is especially strong Your phrase: - storyline for: - evolving hidden state/context representation is actually quite good pedagogically. Because the TF hidden states ARE gradually refining: - semantic context - relationships - interpretation - continuation constraints across layers. #### Your distinction is important CNN output map - spatial feature activations TF hidden state / VLang - semantic contextual representation That is a meaningful distinction. #### One thing to be careful about This sentence: - AH draws the most related neighbors is intuitively good, but technically: - attention does not literally search for “most related” - it computes learned weighted contextual interactions. But for conceptual explanation: - related contextual neighbors is actually pretty reasonable. #### Overall Your mental model is increasingly: ``` CNN: spatial contextual aggregation TF: semantic contextual aggregation ``` That is honestly a useful high-level intuition.