Chihuahua or Muffin?

Image ClassificationCNNDeep LearningPyTorchComputer Vision

A custom CNN to find out if a picture is a chihuahua or a muffin.

My project, Chihuahua or Muffin, was a fun yet challenging attempt at training a neural network to tell the difference between pictures of small dogs (chihuahuas) and muffins, which can surprisingly look very similar. I started by getting about 5,000 images from Kaggle using their API and organized these into structured folders after downloading and unzipping them. I resized each image to 224×224 pixels and converted them into tensors. Then, I divided the data into three groups: 80% for training, 20% for validation, and a separate set for final testing.

First, I tried building a basic two-layer convolutional neural network (CNN). This model used Xavier initialization, ReLU activations, and max-pooling layers. Even though it had around 215,000 parameters, it quickly achieved between 75% and 85% accuracy on the validation set after just 10 epochs. This showed me that even a simple model could recognize differences pretty well. Motivated by this, I decided to design a deeper four-layer CNN thinking it would capture more detailed features. Surprisingly, this deeper model only performed about as well as random guessing (around 50% accuracy). I learned from this that making a model more complex without careful regularization doesn't always help—it can actually make things worse.

So, I went back to basics and developed three different CNN architectures:

Two-layer CNN (around 214,000 parameters) – quick to train and simple, achieving about 85% accuracy on the test set.
Three-layer CNN (around 133,000 parameters) – added one more convolutional block, trained for 20 epochs, and reached my highest accuracy of 87.25%. The training and validation graphs clearly showed steady improvement and good generalization.
CNN with Batch Normalization and Dropout – even though I carefully adjusted these techniques, this model stayed stuck around 55% accuracy, indicating that too much regularization can hinder learning.

In the final evaluation, my best-performing three-layer CNN achieved 87.25% accuracy on the test set. I created a confusion matrix to understand where the model was making mistakes. Most errors came from unusual images, like drawings or heavily distorted photos, or from issues caused by aggressive resizing. Without these problematic images, I think my model's accuracy would probably be over 90%.

Conclusion:
During this project, I learned that the deepest or most complicated models aren't always the best choice. Starting simple, carefully analyzing performance, and slowly adding complexity turned out to be the most effective strategy. My final three-layer CNN successfully balanced complexity and generalization, especially important when dealing with visually similar objects. In the future, I could explore using transfer learning with models like Faster R-CNN, expanding my dataset with more challenging examples, or improving preprocessing to handle different image backgrounds and contexts better. Overall, I gained valuable hands-on experience in data processing, initializing model weights, building CNNs with PyTorch, and thoroughly evaluating my results.

View Code

Paul Louppe

Chihuahua or Muffin?