DevOps

Managing India's Fastest Supercomputer

Rajat Kumar R
12 min read
Managing India's Fastest Supercomputer

Managing India's Fastest Supercomputer

For two years, I had the privilege of managing SahasraT, the supercomputer at the Indian Institute of Science (IISc). Here's what it's like to operate a machine that serves hundreds of researchers.

What is SahasraT?

SahasraT (meaning "thousand-headed" in Sanskrit) was India's fastest supercomputer during my tenure at SERC (Supercomputer Education and Research Centre). The numbers are impressive:

  • 33,000+ CPU cores across 1,500+ nodes
  • Petabytes of storage using Lustre parallel filesystem
  • InfiniBand interconnect with 100 Gbps bandwidth
  • Serving 500+ researchers from institutions across India

Daily Operations

Morning Health Checks

Every day started with reviewing overnight alerts:

  • Node failures (typically 1-3 nodes per day in a system this size)
  • Storage utilization (researchers can fill a petabyte surprisingly fast)
  • Job queue status (SLURM scheduler logs)
  • Network performance metrics

User Support

Researchers aren't systems administrators. Common support requests:

  • "My job has been queued for 3 days"
  • "I need more storage quota"
  • "Why is my MPI program only using one node?"
  • "Can you install [obscure library]?"

Incident Response

When 33,000 cores depend on you, incidents are inevitable:

Memorable Incident #1: A researcher's job ran away with memory, causing the Lustre filesystem to thrash. We had to identify and kill the job while 200 other jobs were affected.

Memorable Incident #2: A power fluctuation took down an entire rack. Coordinating with facilities to restore power while managing user expectations was challenging.

The Dashboard I Built

Manual monitoring wasn't scalable. I built a Django dashboard that:

  1. Aggregated metrics from SLURM, Nagios, and custom agents
  2. Visualized node health with an interactive cluster map
  3. Tracked job analytics - which queues are busy, average wait times
  4. Automated reporting - weekly utilization reports for management

Technical Challenges

Data volume: With 33,000 cores reporting metrics every minute, we generated gigabytes of time-series data daily. PostgreSQL with TimescaleDB handled this efficiently.

Real-time updates: Users wanted live job status. Server-Sent Events provided efficient push updates without WebSocket complexity.

Authentication: Integrating with the institute's LDAP while maintaining security was tricky.

Lessons for Any Large System

1. Automation is Essential

At scale, manual intervention doesn't work. Automate:

  • Health checks
  • Log rotation
  • Backup verification
  • User provisioning

2. Documentation Saves Lives

When I started, tribal knowledge was scattered. I documented:

  • Runbooks for common incidents
  • Architecture diagrams
  • Vendor contact information
  • Escalation procedures

3. Capacity Planning

Researchers always want more. Track usage trends and plan expansion before you hit limits.

Conclusion

Operating a supercomputer taught me systems thinking at scale. The principles - automation, monitoring, documentation, capacity planning - apply to any distributed system, from Kubernetes clusters to cloud infrastructure.

If you get the chance to work on large-scale systems, take it. The experience is invaluable.

Share this article

Related Articles

Building Real-Time Trading Platforms
Architecture

Building Real-Time Trading Platforms

Lessons learned from architecting high-frequency trading systems with modern tech stack.

8 min read
AI-Accelerated Development
AI

AI-Accelerated Development

How I use Claude and Gemini to 10x my development productivity.

5 min read