← 2 AI (models)


The core of CNNs:

  • Convolutions (localized spatial filters)
  • NNs

See _ziptieai_book1_CNN.docx on the Gdrive.

drones


Demo 4: 2.2.1 D4 CNN image classifier (26.0528)

Runs on render (CPU mode).

drones


Demo 4b: 2.2.1b D4 CNN algorithm details (26.0528)

drones


Demo 9: 2.2.2 D9 CNN defect detector (26.0606)

drones


Demo 10: 2.2.3 D10 CNN feature map visualization (26.0606)

drones








4 CNN concepts


(1) CNN ALGORITHM core

CNNs (Convoluted Neural Networks) that perform object recognition are much simpler than TFs. CNNs are an excellent stepping stone / analogy for understanding TFs.

1) We see pics; CNNs only calculate numbers (they see nothing)

both achieve kind of the same result. but not really. the cnn is faking it.

but its a good fake. works most times. just dont trust your life with.

2) Convolute

  • LAYERS
    • (1) mixing pixel content
    • (2) pooling and
  • (3) NN detecting
  • (4) select.

after the first layer they are no longer pixels. (crazy diagrams of what the cnn is doing.. this is nonsense, like the 2d plots for TFs). they are “hidden states”, which I called “vector” language.

3) At each layer the detection gets higher level

4) Last layer: Go thru detector to compute most probable content

(2) How TF is similar

the similarities really struck me when first started to dig into the details of TFs. but i never once read anything about these similarites.

  • 96 layers
    • (1) mix context
    • (2) softmax to “pool”, and
    • (3) FFN (detection)
  • (4) Woutput








The AI in the AI drones was object recognition using CNNs (convoluted neural networks) running on the Nvidia Jetson Nano and on the PI computer. CNNs are similar in many ways to LLM transformers (TFs). They are good topic to study before studying TFs. And AlexNet 2012 CNN is an excellent demo to understand the basics before tackling TF UFA.

See also the chat 2.2.1 CNN UFA.

The diagrams below are my original. For more details, see _ziptieai_book1_CNN.docx on the Gdrive.

desc

desc






4 CNN/TF ALGORITHM DETAILS (MATH) CNNs (Convoluted Neural Networks) that perform object recognition are much simpler than TFs. CNNs are an excellent stepping stone / analogy for understanding TFs. ## 26.0523 ### in CNN models the core NN is the dense layers, right? in LLMs its the TF FFN primarily? Roughly speaking — yes. Your intuition is actually pretty good here. #### CNNs In CNNs: - convolution layers primarily do: - feature extraction - spatial pattern detection - hierarchical visual abstraction. Then: - dense/fully-connected layers often become the: - final decision/classification engine. So for classic CNNs like AlexNet: - pixels - → convolutions - → feature maps - → dense layers - → output probabilities Your statement: - core NN is dense layers is not entirely wrong conceptually, though: - convolution layers are ALSO neural-network computation, - and usually contain most of the spatial intelligence. But yes: - dense layers often perform the final semantic classification. #### LLMs / Transformers Your intuition here is ALSO very aligned with modern thinking. Many researchers now believe: - FFNs contain much of the model's semantic knowledge/features. while: - attention heads primarily: - route information - move context between tokens - coordinate interactions - dynamically select relevant context. So your mental model: - AHs = routing/context mixing - FFNs = semantic detectors/features is actually pretty reasonable. #### Very rough analogy CNN - convolution - ≈ spatial feature extraction - dense layers - ≈ final semantic classification Transformer - attention - ≈ contextual routing/mixing - FFN - ≈ semantic feature/detector computation That is not exact academically, but conceptually it is actually fairly useful. ### right, convolutions are "spatial" intellligence.. gathering info from related neighbors, whicch are spatial. in TF that to me is AH, because it draws via KV mechanism the most related neighbors. CNN encodes results in the new "pixel" map (not really pixels, but detections via spatial and filters), TF encodes in the vector language (12288 FP #s for gpt-3) "storyline". Yes — honestly this is becoming a pretty coherent conceptual model. You are basically saying: #### CNN ``` local spatial neighborhoods → convolution filters → feature maps → hierarchical visual abstractions ``` The “neighbor relationships” are: - physical/spatial nearby pixels. #### Transformer ``` token relationships → attention/KV retrieval → contextual aggregation → storyline refinement ``` The “neighbor relationships” are: - semantic/contextual rather than spatial. That is actually a pretty good intuition-builder. #### Your analogy CNN - convolution gathers nearby spatial information TF - attention gathers related contextual information is conceptually reasonable. #### This is especially strong Your phrase: - storyline for: - evolving hidden state/context representation is actually quite good pedagogically. Because the TF hidden states ARE gradually refining: - semantic context - relationships - interpretation - continuation constraints across layers. #### Your distinction is important CNN output map - spatial feature activations TF hidden state / VLang - semantic contextual representation That is a meaningful distinction. #### One thing to be careful about This sentence: - AH draws the most related neighbors is intuitively good, but technically: - attention does not literally search for “most related” - it computes learned weighted contextual interactions. But for conceptual explanation: - related contextual neighbors is actually pretty reasonable. #### Overall Your mental model is increasingly: ``` CNN: spatial contextual aggregation TF: semantic contextual aggregation ``` That is honestly a useful high-level intuition.


26.0529