GenAI Operations

Qantra

1 hour ago

Full-time

Remote

Worldwide

Remote QA Jobs

About the job

LLMOPS/GenAI Ops

Location: India Remote

Pay : INR 45 - INR 50 LPA

Experience - 10–14+ years overall, operating at Lead / Principal level

Employment Type - Full time

We are seeking a Lead Azure GenAIOps / LLMOps Engineer to design, build, and operate a secure, observable, governed Azure GenAI platform that can be reused by multiple product and business teams.

This role is not focused on model training or fine-tuning. Instead, it owns LLM operationalization, governance, observability, safety, cost control, and platform reliability across enterprise environments.

You will work at the intersection of AI Platform Engineering, LLMOps, Cloud Architecture, and DevSecOps, partnering closely with application teams, security teams, and cloud platform teams.

Key Responsibilities

1. Azure GenAI Platform Ownership

• Architect and operate a shared, multi-tenant Azure GenAI platform using:

Azure OpenAI
Azure AI Foundry (must-have)

• Define reference architectures for RAG, agents, and LLM-powered apps.

• Decide and document usage patterns across:

AKS, App Service, and Azure ML (Candidate should have strong experience with at least one; platform design should support multiple runtimes.)

2. LLM Runtime, Agent & Tool Governance

• Implement AI Gateway / Azure API Management for:

Model routing and abstraction
Throttling and quota enforcement
Authentication and authorization

• Govern agent runtimes, including:

Tool access control
Permissions and identity boundaries
Authentication, audit logging, and traceability

• Define MCP server / tool governance standards:

Function calling approvals
Tool versioning
Change control and auditability

3. CI/CD, Environment Promotion & Configuration Management

• Build reusable pipeline templates for GenAI workloads.

• Define environment promotion models across:

DEV → NON-PROD → PROD

• Enforce:

Git-based prompt, agent, and config versioning
Approval workflows
Rollback and hotfix strategies

• Manage golden datasets and regression test suites for:

Prompts
Agents
RAG pipelines

4. Observability, Quality & Reliability

• Implement LLM observability using tools such as:

Langfuse
OpenTelemetry
Azure Monitor / Application Insights

• Enable:

Prompt & response tracing
Retrieval tracing
Tool-call tracing
Token usage tracking
Cost and latency dashboards

• Define and enforce SLIs/SLOs for GenAI workloads.

• Own incident response, on-call readiness, rollback, and DR testing.

5. RAG Quality & Evaluation

• Implement continuous monitoring for:

Retrieval quality
Chunk quality
Citation quality
Grounding score
Hallucination regression

• Automate evaluation gates in CI/CD pipelines.

• Maintain baseline and golden datasets to detect quality drift.

6. GenAI Safety & Responsible AI Controls

• Implement enterprise safety controls:

Prompt shields
Jailbreak detection
Groundedness checks
Content moderation
PII / PHI masking

• Design human-in-the-loop review and escalation workflows for risky outputs.

• Collaborate with security teams on policy definitions (ownership is shared, not siloed).

7. Security, Networking & Identity (Design Ownership)

• Design secure Azure architectures using:

Private networking
Private Endpoints
Managed Identities
Azure Key Vault
VNet isolation

• Clarify responsibility boundaries:

Own GenAI platform security design
Collaborate with core security / platform teams for enterprise controls

• Heavy DevSecOps controls (SBOM, image signing, admission checks) are good-to-have unless mandated by environment.

8. Cost, Routing & Performance Optimization

• Implement:

Model routing and fallback strategies
Throttling and quota management

• Optimize cost by:

Model
Application
User
Environment
Tenant

• Build token and cost dashboards for leadership visibility.

9. Compliance & Audit Automation

• Automate compliance evidence generation:

Policy enforcement proofs
Audit trails
Access logs
Promotion records

• Reduce reliance on manual audit documentation.

Core Deliverables (Expected Outcomes)

• Enterprise-grade Azure GenAI reference architectures

• Reusable CI/CD pipeline templates

• Secure AI Gateway patterns

• Governed agent and tool frameworks

• Observability dashboards and alerts

• Regression test suites and golden datasets

• Platform onboarding guides and standards

Required Skills

Azure & AI Platform

• Azure OpenAI, Azure AI Foundry (mandatory)

• AKS or App Service or Azure ML (deep expertise in at least one)

• Azure API Management / AI Gateway patterns

• Private networking, Managed Identity, Key Vault

LLMOps & Governance

• RAG architectures and evaluation

• Prompt, agent & config lifecycle management

• Model routing, fallback, and throttling strategies

• Multi-tenant GenAI platform experience (strongly preferred)

Automation & Engineering

• Python, Bash, YAML

•REST APIs and SDK-based automation

• CI/CD using Azure DevOps or GitHub Actions

• Terraform or Bicep

Observability & Reliability

• Langfuse, OpenTelemetry, Azure Monitor, App Insights

• SLIs/SLOs, incident management, production support

Good to Have

• Semantic Kernel

• Microsoft Agent Framework

• LangChain, Agno

• FastAPI

• Advanced DevSecOps controls (SBOM, image signing, admission checks)

• Azure security and architecture certifications

Requirements added by the job poster

• 5+ years of work experience with Azure OpenAI

• 5+ years of work experience with Azure AI Foundry

• 9+ years of work experience with Python (Programming Language)

Apply now

GenAI Operations

About the job

More jobs

Test Automation Engineer

1Kosmos

Sr Software Engineer

ACI Worldwide