DAT Inspired Lightweight Attention Network for Real World Image Super Resolution
This is an inspired super resolution arch based on various archs primarily DAT, PLKSR, SPAN and other. Intended to be a faster version of DAT family of models while being visually and perceptually close.
Work created with the help of LLMs like Claude & Gemini
contact sharekhan. on the Enhance Everything Discord.
FDAT (FastDAT) is a neural network architecture designed for image super-resolution tasks. It combines the strengths of both attention mechanisms and convolutional operations to effectively upscale low-resolution images while preserving fine details and structural information.
FDAT employs two complementary attention mechanisms:
- Spatial Window Attention: Processes spatial relationships within local windows, reducing computational complexity compared to full spatial attention while maintaining effectiveness
- Channel Attention: Focuses on inter-channel relationships to enhance feature representation and selection
The "Fast" in FDAT refers to several optimizations:
- Window-based Spatial Attention: Instead of computing attention across the entire spatial dimension, it uses local windows (default 8×8), significantly reducing computational cost
- Simplified Components: Streamlined implementations of attention and feed-forward modules
- Efficient Channel Attention: Uses normalized queries and keys for stable and fast channel-wise attention computation
A key innovation is the Simplified AIM (Attention Interaction Module) that intelligently fuses:
- Attention-based global feature understanding
- Convolution-based local feature extraction
This dual pathway allows the model to capture both local textures and global structural patterns effectively.
Input Image
↓
[Initial Convolution] - Embed input to feature space
↓
[Residual Groups] - Multiple groups of dual attention blocks
├── Spatial Window Attention + Conv → AIM Fusion
├── Channel Attention + Conv → AIM Fusion
└── FFN with Spatial Mixing
↓
[Final Convolution] - Feature refinement
↓
[Flexible Upsampler] - Multiple upsampling strategies
↓
Output HR Image
The core building block that:
- Alternates between spatial and channel attention based on block pattern
- Runs parallel convolution pathway for local feature extraction
- Fuses attention and convolution features via AIM
- Applies feed-forward network with spatial mixing
- Divides input into non-overlapping windows
- Computes self-attention within each window
- Uses learnable relative position bias
- Handles padding for inputs not divisible by window size
- Normalizes queries and keys for stable training
- Uses temperature-scaled attention for better control
- Focuses on channel-wise feature relationships
- Expands feature dimensions for enhanced representation
- Includes spatial mixing via depthwise convolution
- Applies dropout for regularization
Supports multiple upsampling strategies:
- PixelShuffle: Efficient sub-pixel convolution
- Transposed Convolution: Learnable upsampling
- Nearest + Conv: Simple interpolation with refinement
- DySample: Advanced dynamic sampling (requires spandrel)
- Convolution captures fine textures and local patterns
- Window attention captures spatial relationships and structures
- Channel attention enhances feature discriminability
- AIM module intelligently combines these complementary representations
- Window-based attention scales linearly with image size
- Simplified implementations reduce overhead
- Parallel conv-attention pathways enable efficient processing
- Configurable attention patterns (spatial/channel alternation)
- Multiple model sizes (tiny, light, medium, large, xl)
- Various upsampling strategies for different use cases
- Adjustable window sizes and attention heads
- Stochastic depth (DropPath) for regularization
- Normalized attention for stable gradients
- Residual connections preserve information flow
- Careful weight initialization
Model | Embed Dim | Groups | Depth/Group | Heads | Parameters* |
---|---|---|---|---|---|
fdat_tiny |
96 | 2 | 2 | 3 | ~500K |
fdat_light |
108 | 3 | 2 | 4 | ~800K |
fdat_medium |
120 | 4 | 3 | 4 | ~1.5M |
fdat_large |
180 | 4 | 4 | 6 | ~3.5M |
fdat_xl |
180 | 6 | 6 | 6 | ~7M |
*Approximate parameter counts, varies with configuration
import torch
from fdat_arch import fdat_tiny, fdat_medium, FDAT
# Use predefined model
model = fdat_tiny(scale=4)
# Or create custom configuration
model = FDAT(
embed_dim=96,
num_groups=3,
depth_per_group=2,
num_heads=4,
window_size=8,
upsampler_type="pixelshuffle",
scale=4
)
# Super-resolve an image
lr_image = torch.randn(1, 3, 64, 64) # Low resolution input
with torch.no_grad():
hr_image = model(lr_image) # High resolution output (1, 3, 256, 256)
FDAT represents a balanced approach to super-resolution that:
- Leverages both attention and convolution: Rather than replacing convolutions entirely, it creates synergy between global attention and local convolution
- Prioritizes efficiency: Uses window attention and optimized implementations to maintain practical computational requirements
- Maintains flexibility: Supports various configurations and upsampling strategies for different use cases
- Focuses on stability: Incorporates modern training techniques like stochastic depth and careful normalization
- PyTorch >= 1.9.0
- numpy
- spandrel (for DySample upsampler)
pip install torch numpy spandrel
FDAT is particularly well-suited for:
- Image Super-Resolution: Primary use case for enhancing image resolution
- Image Restoration: Can be adapted for denoising and artifact removal
- Real-time Applications: Efficient architecture suitable for deployment
- Research: Flexible foundation for super-resolution research
The architecture's combination of efficiency, effectiveness, and flexibility makes it a strong choice for both practical applications and research in image enhancement tasks.