2.2.1 UFA CNN denselayers
25.0523 work in progress…….
tell me about the 3 layers of dense 4096 fully connected neurons in alexnet that output 1 of 1000 classes
AlexNet ended with 3 large fully connected (dense) layers after the convolution/pooling stages. Very roughly:
- image
- → convolution layers
- → pooling
- → feature maps
- → flatten
- → FC1 (4096)
- → FC2 (4096)
- → FC3 (1000)
- → softmax
- → class probabilities
What the FC layers did The earlier convolution layers detected: • edges • textures • shapes • object parts
The dense layers then combined those detected features into: • higher-level concepts • object hypotheses • final classifications
FC1 (4096 neurons) Input:
- flattened convolution features
This layer learned large combinations of visual features. Example intuition:
- fur texture
- round ears
- eye pattern
might strongly activate some neurons. This was already much more abstract than raw pixels.
FC2 (4096 neurons) This layer recombined FC1 abstractions into even higher-level object concepts. Example intuition:
- cat-like
- dog-like
- vehicle-like
- face-like
More semantic/object-level structure emerged here.
FC3 (1000 neurons) Final classifier layer. One neuron per ImageNet class:
- cat
- dog
- car
- banana
- airplane …
The output values were logits/scores, not probabilities yet.
Softmax Softmax converted the 1000 logits into probabilities:
- 0.91 cat
- 0.03 dog
- 0.01 fox … Highest probability became the predicted class.
Important connection to transformers These FC layers are conceptually similar to transformer FFNs: feature detection
- → abstraction formation
- → recombination
- → output refinement
The main difference: CNNs:
- operate on spatial image features
Transformers:
- operate on token hidden-state representations
But both heavily rely on:
- large dense nonlinear feature-processing layers
what does “flatten” mean?
In AlexNet, the convolution layers produce many 2D feature maps. Example:
- 256 feature maps
- 13 × 13 each
So internally you have something like:
- 256 × 13 × 13
That is still spatial/image-style structure.
Flatten
“Flatten” means:
- convert all values into one long vector
Example:
- 256 × 13 × 13
- =
- 43264 values
becomes:
- [ x1 x2 x3 x4 … x43264 ]
one giant 1D vector. That vector is then fed into the dense FC layer. So: feature maps
- → flatten
- → giant vector
- → dense neurons
Flatten destroys the explicit 2D image geometry and turns the detected visual features into:
- general feature inputs for dense processing
This is somewhat analogous to transformer hidden states becoming abstract feature vectors.
i thought a max pooling was done on the 256 13x13 feature maps. before inputting to dense layers. result: 256 5x5
Yes — you are correct. In AlexNet there are multiple convolution + max-pooling stages before flattening. So the pipeline is more like:
- image
- → conv
- → max pool
- → conv
- → max pool
- → conv
- → conv
- → conv
- → max pool
- → flatten
- → FC layers
Near the end, AlexNet produces something roughly like:
- 256 feature maps
- 6×6 each
(depending on implementation/details).
Then flatten converts:
- 256 × 6 × 6
into:
- 9216 values
which feed the first dense layer. So yes: max pooling reduces the spatial dimensions BEFORE flattening.
That is important because otherwise the dense layers would become astronomically large.