Depthwise Conv2d throughput degrades significantly when spatial dimensions are not divisible by 16. This surfaces in multi-stage vision encoders where repeated 2x downsampling produces non-mod-16 intermediate feature maps.
Repro
repro_depthwise_mod16.py
Results
| Channels |
Mod-16 res |
Other res |
Mod-16 ms |
Other ms |
Gap |
Other mod 16 |
| 96 |
256x256 |
240x240 |
0.52 |
0.50 |
-5% |
0 |
| 192 |
128x128 |
120x120 |
0.47 |
0.45 |
-5% |
8 |
| 384 |
64x64 |
60x60 |
0.43 |
0.83 |
+94% |
12 |
| 768 |
32x32 |
30x30 |
0.37 |
0.55 |
+47% |
14 |
| 1536 |
16x16 |
15x15 |
0.37 |
0.47 |
+27% |
15 |
When the non-mod-16 dimension is 240 (still divisible by 16), there is no penalty. The gap appears at 120 and below, where the remainder is nonzero, and compounds across stages to produce ~53% end-to-end throughput loss through a 5-stage encoder.
Expected behavior
Depthwise Conv2d throughput should scale proportionally with pixel count regardless of whether spatial dimensions are divisible by 16.
Depthwise Conv2d throughput degrades significantly when spatial dimensions are not divisible by 16. This surfaces in multi-stage vision encoders where repeated 2x downsampling produces non-mod-16 intermediate feature maps.
Repro
repro_depthwise_mod16.py
Results
When the non-mod-16 dimension is 240 (still divisible by 16), there is no penalty. The gap appears at 120 and below, where the remainder is nonzero, and compounds across stages to produce ~53% end-to-end throughput loss through a 5-stage encoder.
Expected behavior
Depthwise Conv2d throughput should scale proportionally with pixel count regardless of whether spatial dimensions are divisible by 16.