DevOps

Managing India's Fastest Supercomputer

Rajat Kumar R

January 5, 2026

12 min read

Managing India's Fastest Supercomputer

For two years, I had the privilege of managing SahasraT, the supercomputer at the Indian Institute of Science (IISc). Here's what it's like to operate a machine that serves hundreds of researchers.

What is SahasraT?

SahasraT (meaning "thousand-headed" in Sanskrit) was India's fastest supercomputer during my tenure at SERC (Supercomputer Education and Research Centre). The numbers are impressive:

33,000+ CPU cores across 1,500+ nodes
Petabytes of storage using Lustre parallel filesystem
InfiniBand interconnect with 100 Gbps bandwidth
Serving 500+ researchers from institutions across India

Daily Operations

Morning Health Checks

Every day started with reviewing overnight alerts:

Node failures (typically 1-3 nodes per day in a system this size)
Storage utilization (researchers can fill a petabyte surprisingly fast)
Job queue status (SLURM scheduler logs)
Network performance metrics

User Support

Researchers aren't systems administrators. Common support requests:

"My job has been queued for 3 days"
"I need more storage quota"
"Why is my MPI program only using one node?"
"Can you install [obscure library]?"

Incident Response

When 33,000 cores depend on you, incidents are inevitable:

Memorable Incident #1: A researcher's job ran away with memory, causing the Lustre filesystem to thrash. We had to identify and kill the job while 200 other jobs were affected.

Memorable Incident #2: A power fluctuation took down an entire rack. Coordinating with facilities to restore power while managing user expectations was challenging.

The Dashboard I Built

Manual monitoring wasn't scalable. I built a Django dashboard that:

Aggregated metrics from SLURM, Nagios, and custom agents
Visualized node health with an interactive cluster map
Tracked job analytics - which queues are busy, average wait times
Automated reporting - weekly utilization reports for management

Technical Challenges

Data volume: With 33,000 cores reporting metrics every minute, we generated gigabytes of time-series data daily. PostgreSQL with TimescaleDB handled this efficiently.

Real-time updates: Users wanted live job status. Server-Sent Events provided efficient push updates without WebSocket complexity.

Authentication: Integrating with the institute's LDAP while maintaining security was tricky.

Lessons for Any Large System

1. Automation is Essential

At scale, manual intervention doesn't work. Automate:

Health checks
Log rotation
Backup verification
User provisioning

2. Documentation Saves Lives

When I started, tribal knowledge was scattered. I documented:

Runbooks for common incidents
Architecture diagrams
Vendor contact information
Escalation procedures

3. Capacity Planning

Researchers always want more. Track usage trends and plan expansion before you hit limits.

Conclusion

Operating a supercomputer taught me systems thinking at scale. The principles - automation, monitoring, documentation, capacity planning - apply to any distributed system, from Kubernetes clusters to cloud infrastructure.

If you get the chance to work on large-scale systems, take it. The experience is invaluable.

Share this article

Architecture

Building Real-Time Trading Platforms

Lessons learned from architecting high-frequency trading systems with modern tech stack.

8 min readJan 15, 2026

AI-Accelerated Development

How I use Claude and Gemini to 10x my development productivity.

5 min readJan 10, 2026

Managing India's Fastest Supercomputer

What is SahasraT?

Daily Operations

Morning Health Checks

User Support

Incident Response

The Dashboard I Built

Technical Challenges

Lessons for Any Large System

1. Automation is Essential

2. Documentation Saves Lives

3. Capacity Planning

Conclusion

Share this article

Related Articles

Building Real-Time Trading Platforms

AI-Accelerated Development