← 2.2 CNNs


25.0523 work in progress…….


tell me about the 3 layers of dense 4096 fully connected neurons in alexnet that output 1 of 1000 classes

AlexNet ended with 3 large fully connected (dense) layers after the convolution/pooling stages. Very roughly:

  • image
  • → convolution layers
  • → pooling
  • → feature maps
  • → flatten
  • → FC1 (4096)
  • → FC2 (4096)
  • → FC3 (1000)
  • → softmax
  • → class probabilities

What the FC layers did The earlier convolution layers detected: • edges • textures • shapes • object parts

The dense layers then combined those detected features into: • higher-level concepts • object hypotheses • final classifications

FC1 (4096 neurons) Input:

  • flattened convolution features

This layer learned large combinations of visual features. Example intuition:

  • fur texture
  • round ears
  • eye pattern

might strongly activate some neurons. This was already much more abstract than raw pixels.

FC2 (4096 neurons) This layer recombined FC1 abstractions into even higher-level object concepts. Example intuition:

  • cat-like
  • dog-like
  • vehicle-like
  • face-like

More semantic/object-level structure emerged here.

FC3 (1000 neurons) Final classifier layer. One neuron per ImageNet class:

  • cat
  • dog
  • car
  • banana
  • airplane …

The output values were logits/scores, not probabilities yet.

Softmax Softmax converted the 1000 logits into probabilities:

  • 0.91 cat
  • 0.03 dog
  • 0.01 fox … Highest probability became the predicted class.

Important connection to transformers These FC layers are conceptually similar to transformer FFNs: feature detection

  • → abstraction formation
  • → recombination
  • → output refinement

The main difference: CNNs:

  • operate on spatial image features

Transformers:

  • operate on token hidden-state representations

But both heavily rely on:

  • large dense nonlinear feature-processing layers

what does “flatten” mean?

In AlexNet, the convolution layers produce many 2D feature maps. Example:

  • 256 feature maps
  • 13 × 13 each

So internally you have something like:

  • 256 × 13 × 13

That is still spatial/image-style structure.

Flatten

“Flatten” means:

  • convert all values into one long vector

Example:

  • 256 × 13 × 13
  • =
  • 43264 values

becomes:

  • [ x1 x2 x3 x4 … x43264 ]

one giant 1D vector. That vector is then fed into the dense FC layer. So: feature maps

  • → flatten
  • → giant vector
  • → dense neurons

Flatten destroys the explicit 2D image geometry and turns the detected visual features into:

  • general feature inputs for dense processing

This is somewhat analogous to transformer hidden states becoming abstract feature vectors.

i thought a max pooling was done on the 256 13x13 feature maps. before inputting to dense layers. result: 256 5x5

Yes — you are correct. In AlexNet there are multiple convolution + max-pooling stages before flattening. So the pipeline is more like:

  • image
  • → conv
  • → max pool
  • → conv
  • → max pool
  • → conv
  • → conv
  • → conv
  • → max pool
  • → flatten
  • → FC layers

Near the end, AlexNet produces something roughly like:

  • 256 feature maps
  • 6×6 each

(depending on implementation/details).

Then flatten converts:

  • 256 × 6 × 6

into:

  • 9216 values

which feed the first dense layer. So yes: max pooling reduces the spatial dimensions BEFORE flattening.

That is important because otherwise the dense layers would become astronomically large.