Managing India's Fastest Supercomputer

Managing India's Fastest Supercomputer
For two years, I had the privilege of managing SahasraT, the supercomputer at the Indian Institute of Science (IISc). Here's what it's like to operate a machine that serves hundreds of researchers.
What is SahasraT?
SahasraT (meaning "thousand-headed" in Sanskrit) was India's fastest supercomputer during my tenure at SERC (Supercomputer Education and Research Centre). The numbers are impressive:
- 33,000+ CPU cores across 1,500+ nodes
- Petabytes of storage using Lustre parallel filesystem
- InfiniBand interconnect with 100 Gbps bandwidth
- Serving 500+ researchers from institutions across India
Daily Operations
Morning Health Checks
Every day started with reviewing overnight alerts:
- Node failures (typically 1-3 nodes per day in a system this size)
- Storage utilization (researchers can fill a petabyte surprisingly fast)
- Job queue status (SLURM scheduler logs)
- Network performance metrics
User Support
Researchers aren't systems administrators. Common support requests:
- "My job has been queued for 3 days"
- "I need more storage quota"
- "Why is my MPI program only using one node?"
- "Can you install [obscure library]?"
Incident Response
When 33,000 cores depend on you, incidents are inevitable:
Memorable Incident #1: A researcher's job ran away with memory, causing the Lustre filesystem to thrash. We had to identify and kill the job while 200 other jobs were affected.
Memorable Incident #2: A power fluctuation took down an entire rack. Coordinating with facilities to restore power while managing user expectations was challenging.
The Dashboard I Built
Manual monitoring wasn't scalable. I built a Django dashboard that:
- Aggregated metrics from SLURM, Nagios, and custom agents
- Visualized node health with an interactive cluster map
- Tracked job analytics - which queues are busy, average wait times
- Automated reporting - weekly utilization reports for management
Technical Challenges
Data volume: With 33,000 cores reporting metrics every minute, we generated gigabytes of time-series data daily. PostgreSQL with TimescaleDB handled this efficiently.
Real-time updates: Users wanted live job status. Server-Sent Events provided efficient push updates without WebSocket complexity.
Authentication: Integrating with the institute's LDAP while maintaining security was tricky.
Lessons for Any Large System
1. Automation is Essential
At scale, manual intervention doesn't work. Automate:
- Health checks
- Log rotation
- Backup verification
- User provisioning
2. Documentation Saves Lives
When I started, tribal knowledge was scattered. I documented:
- Runbooks for common incidents
- Architecture diagrams
- Vendor contact information
- Escalation procedures
3. Capacity Planning
Researchers always want more. Track usage trends and plan expansion before you hit limits.
Conclusion
Operating a supercomputer taught me systems thinking at scale. The principles - automation, monitoring, documentation, capacity planning - apply to any distributed system, from Kubernetes clusters to cloud infrastructure.
If you get the chance to work on large-scale systems, take it. The experience is invaluable.
Share this article
Related Articles

Building Real-Time Trading Platforms
Lessons learned from architecting high-frequency trading systems with modern tech stack.

AI-Accelerated Development
How I use Claude and Gemini to 10x my development productivity.