Revisiting Skip Connections in Transformers

Abdulkader Helwan
5 min readJan 4, 2024

This article is part of a series about the Transformer skip connections and attention layers. If you haven’t read the others, refer to the introductory article here. The next article of this series is here.

I recently had a chat with one of my best friends who happens to be a great Machine Learning scientist, working at a very big company (Spire, Luxembourg). Our friendship goes way back, we had our Masters at the same University, and I learned a lot from him. Long story short, we were talking about Transformers, and I realized that my friend believes that what makes Transformers good is not the Attention, It is rather the skip connections. He said they wanted to make it ’fancy’ so of course they can’t just say it is the skip connections as it was invented a long time ago. Hence, it was the Attention block that got all interests and spotlight. My friend believes that attention is good but not really needed in the case of Transformers as it is doing the same task done by Skip Connections.

Bottom line, my friend is trying to prove this by conducting some experiments on Transformers with attention and skip connections. We will publish some results in case we get some validating results, in a journal most probably.

For now, I wanted to share this here and make my own simple experiments concerning this topic. First, in my first post, I will try to study the effects of skip connection on the Transformer performance. That is simple, we will create two Transformers: one with skip connections…

--

--