AVP , Principal Engineer - AI Agent Engineering (L11) (6)

all cities, CO 6On-sitePosted 3 hours ago

Business Services & Consulting

About the Role

AVP, Principal Engineer - Ai Agent Engineering (L11)

Synchrony (NYSE: SYF) is a premier consumer financial services company delivering one of the industry's most complete digitally enabled product suites. Our experience, expertise and scale encompass a broad spectrum of industries including digital, health and wellness, retail, telecommunications, home, auto, outdoors, pet and more.

Synchrony's Engineering Team is a dynamic and innovative team dedicated to driving technological excellence. As a member of this team, you'll play a pivotal role in designing and developing cutting-edge tech stack and solutions that redefine industry standards.

We are looking for a hands-on Principal Engineer / Tech Lead who can both build and guide the build of production-grade agentic AI applications on AWS — primarily on the AWS Strands SDK and Bedrock AgentCore — and who treats agents as real software systems, not prompt experiments. This person owns the engineering quality bar for 1–2 delivery teams: design patterns, Python craft, testing, evals, CI/CD, observability, and on-call readiness. They write code roughly 60% of the time and lead/review/mentor the rest.

They consume the primitives defined by the AI Platform / Cloud Solution Architect (model gateway, MCP/tool registry, memory store, eval harness, guardrails, identity, kill switch) rather than rebuilding them, and hold their teams to that same discipline.

Key Responsibilities

Design and implement end-to-end agents on AWS Strands SDK and Bedrock AgentCore, including tool orchestration, multi-step reasoning, memory, HITL gates, and structured streaming. Use CrewAI or LangChain/LangGraph only where Strands does not fit, with explicit justification.
Write clean, idiomatic, testable Python: typed (mypy/pyright), modular, async-aware, with strong package boundaries. Follow SOLID and apply the right design pattern for the job (Strategy, Adapter, Repository, Factory, Mediator, Circuit Breaker, Saga) — and apply the right agentic pattern (ReAct, Plan-and-Execute, Reflection, Router, Hierarchical Multi-Agent, Tool-Use, RAG-Fusion) deliberately rather than by default.
Build agent backends as proper services (FastAPI), with clear domain models, dependency injection, hexagonal/ports-and-adapters separation between LLM/tool calls and business logic, and contract tests against platform APIs and MCP tools.
Implement RAG correctly: chunking strategy, embedding choice, hybrid retrieval, grounding/citations, and eval-driven iteration. Use only platform-provided vector stores and follow data-classification and residency rules.
Integrate with the model gateway, MCP/tool registry, identity (OIDC/OAuth2/SCIM), and observability SDK rather than calling model APIs or building auth/logging directly.
Set and enforce the engineering bar across 1–2 agile teams: code reviews, design reviews, ADR (architecture decision record) discipline, definition of done, and PR standards. Block merges that skip evals, observability, or guardrails.
Decompose roadmap items into well-shaped backlogs; pair-program with mid/junior engineers; mentor on Python, cloud, and agentic concepts; grow the next tech leads.
Own the team's technical roadmap in partnership with PM and the AI Platform architects; identify reusable primitives and push them upstream into the platform instead of forking.
Drive incident response for agent workflows: triage, RCA, postmortems, and reliability follow-ups. Carry primary on-call rotation alongside the team.
Write designs, runbooks, and ADRs that both engineers and non-technical stakeholders can read.
Own CI/CD for agent services end-to-end (Jenkins or GitHub Actions): unit + integration + contract tests, SAST/secret scanning, image build, IaC plan/apply, and gated promotions across dev/QA/prod AWS accounts.
Treat evals as a first-class CI gate: golden datasets, rule-based and LLM-as-judge scoring, replay harness for deterministic reproduction of production sessions, and regression checks on every model or prompt change.
Instrument every agent through the platform observability SDK: structured logs, OTEL traces with token/cost stamping, per-tool spans, and dashboards in CloudWatch / Splunk / New Relic. Define and meet SLOs (p95 latency, success rate, RAG groundedness).
Run online evals and drift detection on production traffic; wire kill-switch and cost circuit breakers; respond to guardrail and content-safety incidents.
Contribute Terraform/CDK modules for agent services and follow least-privilege IAM, private networking, and secrets-via-vault patterns by default.

Required Skills/Knowledge:

Bachelor's degree in Computer Science / Engineering or equivalent practical experience.
8+ years of software engineering experience, with 3+ years as a tech lead / staff / principal engineer on cloud-based production systems.
Expert-level Python: typing, async, packaging, testing (PyTest, hypothesis), API design (FastAPI), and clean architecture. Able to read and improve a teammate's code on sight.
Demonstrated mastery of software design patterns and the judgment to know when *not* to use them. Comfortable leading design reviews and ADR discussions.
Hands-on experience building agentic / LLM applications in production or advanced pilots, including tool use, multi-step reasoning, memory, and HITL.
Working knowledge of AWS Strands SDK and/or Bedrock AgentCore SDK, plus core AWS services (Bedrock, IAM, VPC/networking, S3, ECR/EKS or ECS, Secrets Manager, CloudWatch).
Solid RAG fundamentals: embeddings, vector stores, hybrid retrieval, grounding, eval-driven iteration.
Strong DevOps / LLMOps: CI/CD pipelines, IaC (Terraform or CDK), containerization, observability (logs/traces/metrics), and incident response. Has carried on-call.
Experience integrating through API / model / MCP gateways with proper authn/z, rate limiting, retries, idempotency, and error semantics.
Track record of mentoring and raising the bar for a team — not just shipping personal code.
Strong written and verbal communication across engineering, product, security, and risk audiences.

Desired Skills/Knowledge:

Production experience with CrewAI and/or LangChain / LangGraph for cases that fall outside Strands.
Multi-model fluency (AWS Nova, Anthropic Claude, OpenAI, Gemini) and pragmatic routing / cost-optimization strategies (cheap-first, cascade, semantic cache).
Familiarity with agentic protocols (MCP, A2A, ACP, AP2) and multi-agent collaboration patterns.
Front-end integration experience (React or similar) for streaming agent UX, chat surfaces, and HITL approval flows.
Custom AI observability: agent-quality metrics, safety/guardrail telemetry, RAG groundedness scoring.
Financial services or other regulated-industry experience (PCI / SOX / GLBA / SR-11-7 / NIST AI RMF awareness).
Open-source contributions, conference talks, or technical writing in the agentic AI space.

Eligibility Criteria: Minimum 8+Years of experience mentioned in " Required Skill/Knowledge" with a Bachelor's Degree or equivalent. In lieu of degree, minimum of 10 years of experience required.

Work Timings: 2 PM – 11 PM IST (Suggested)

This role qualifies for Enhanced Flexibility offered in Synchrony India and will require the incumbent to be available between 06:00 AM Eastern Time – 11:30 AM Eastern Time (timings are anchored to US Eastern hours and will adjust twice a year locally).This window is for meetings with India and US teams.The remaining hours will be flexible for the employee to choose.Exceptions may apply periodically due to business needs) We are proud to offer flexibility at Synchrony.

Our way of working allows you the option to work from home or workspaces in our Regional Engagement Hubs—Hyderabad, Bengaluru, Pune, Kolkata, or Delhi/NCR.Occasionally you may be required to commute or travel to Hyderabad or one of the Regional Engagement Hubs for in person engagement activities such as business or team meetings,

AVP, Principal Engineer - Ai Agent Engineering (L11)

Key Responsibilities

Design and implement end-to-end agents on AWS Strands SDK and Bedrock AgentCore, including tool orchestration, multi-step reasoning, memory, HITL gates, and structured streaming. Use CrewAI or LangChain/LangGraph only where Strands does not fit, with explicit justification.
Write clean, idiomatic, testable Python: typed (mypy/pyright), modular, async-aware, with strong package boundaries. Follow SOLID and apply the right design pattern for the job (Strategy, Adapter, Repository, Factory, Mediator, Circuit Breaker, Saga) — and apply the right agentic pattern (ReAct, Plan-and-Execute, Reflection, Router, Hierarchical Multi-Agent, Tool-Use, RAG-Fusion) deliberately rather than by default.
Build agent backends as proper services (FastAPI), with clear domain models, dependency injection, hexagonal/ports-and-adapters separation between LLM/tool calls and business logic, and contract tests against platform APIs and MCP tools.
Implement RAG correctly: chunking strategy, embedding choice, hybrid retrieval, grounding/citations, and eval-driven iteration. Use only platform-provided vector stores and follow data-classification and residency rules.
Integrate with the model gateway, MCP/tool registry, identity (OIDC/OAuth2/SCIM), and observability SDK rather than calling model APIs or building auth/logging directly.
Set and enforce the engineering bar across 1–2 agile teams: code reviews, design reviews, ADR (architecture decision record) discipline, definition of done, and PR standards. Block merges that skip evals, observability, or guardrails.
Decompose roadmap items into well-shaped backlogs; pair-program with mid/junior engineers; mentor on Python, cloud, and agentic concepts; grow the next tech leads.
Own the team's technical roadmap in partnership with PM and the AI Platform architects; identify reusable primitives and push them upstream into the platform instead of forking.
Drive incident response for agent workflows: triage, RCA, postmortems, and reliability follow-ups. Carry primary on-call rotation alongside the team.
Write designs, runbooks, and ADRs that both engineers and non-technical stakeholders can read.
Own CI/CD for agent services end-to-end (Jenkins or GitHub Actions): unit + integration + contract tests, SAST/secret scanning, image build, IaC plan/apply, and gated promotions across dev/QA/prod AWS accounts.
Treat evals as a first-class CI gate: golden datasets, rule-based and LLM-as-judge scoring, replay harness for deterministic reproduction of production sessions, and regression checks on every model or prompt change.
Instrument every agent through the platform observability SDK: structured logs, OTEL traces with token/cost stamping, per-tool spans, and dashboards in CloudWatch / Splunk / New Relic. Define and meet SLOs (p95 latency, success rate, RAG groundedness).
Run online evals and drift detection on production traffic; wire kill-switch and cost circuit breakers; respond to guardrail and content-safety incidents.
Contribute Terraform/CDK modules for agent services and follow least-privilege IAM, private networking, and secrets-via-vault patterns by default.

Required Skills/Knowledge:

Bachelor's degree in Computer Science / Engineering or equivalent practical experience.
8+ years of software engineering experience, with 3+ years as a tech lead / staff / principal engineer on cloud-based production systems.
Expert-level Python: typing, async, packaging, testing (PyTest, hypothesis), API design (FastAPI), and clean architecture. Able to read and improve a teammate's code on sight.
Demonstrated mastery of software design patterns and the judgment to know when *not* to use them. Comfortable leading design reviews and ADR discussions.
Hands-on experience building agentic / LLM applications in production or advanced pilots, including tool use, multi-step reasoning, memory, and HITL.
Working knowledge of AWS Strands SDK and/or Bedrock AgentCore SDK, plus core AWS services (Bedrock, IAM, VPC/networking, S3, ECR/EKS or ECS, Secrets Manager, CloudWatch).
Solid RAG fundamentals: embeddings, vector stores, hybrid retrieval, grounding, eval-driven iteration.
Strong DevOps / LLMOps: CI/CD pipelines, IaC (Terraform or CDK), containerization, observability (logs/traces/metrics), and incident response. Has carried on-call.
Experience integrating through API / model / MCP gateways with proper authn/z, rate limiting, retries, idempotency, and error semantics.
Track record of mentoring and raising the bar for a team — not just shipping personal code.
Strong written and verbal communication across engineering, product, security, and risk audiences.

Desired Skills/Knowledge:

Production experience with CrewAI and/or LangChain / LangGraph for cases that fall outside Strands.
Multi-model fluency (AWS Nova, Anthropic Claude, OpenAI, Gemini) and pragmatic routing / cost-optimization strategies (cheap-first, cascade, semantic cache).
Familiarity with agentic protocols (MCP, A2A, ACP, AP2) and multi-agent collaboration patterns.
Front-end integration experience (React or similar) for streaming agent UX, chat surfaces, and HITL approval flows.
Custom AI observability: agent-quality metrics, safety/guardrail telemetry, RAG groundedness scoring.
Financial services or other regulated-industry experience (PCI / SOX / GLBA / SR-11-7 / NIST AI RMF awareness).
Open-source contributions, conference talks, or technical writing in the agentic AI space.

Eligibility Criteria: Minimum 8+Years of experience mentioned in " Required Skill/Knowledge" with a Bachelor's Degree or equivalent. In lieu of degree, minimum of 10 years of experience required.

Work Timings: 2 PM – 11 PM IST (Suggested)

What You'll Do

Design and implement end-to-end agents on AWS Strands SDK and Bedrock AgentCore, including tool orchestration, multi-step reasoning, memory, HITL gates, and structured streaming. Use CrewAI or LangChain/LangGraph only where Strands does not fit, with explicit justification.

Write clean, idiomatic, testable Python: typed (mypy/pyright), modular, async-aware, with strong package boundaries. Follow SOLID and apply the right design pattern for the job (Strategy, Adapter, Repository, Factory, Mediator, Circuit Breaker, Saga) — and apply the right agentic pattern (ReAct, Plan-and-Execute, Reflection, Router, Hierarchical Multi-Agent, Tool-Use, RAG-Fusion) deliberately rather than by default.

Build agent backends as proper services (FastAPI), with clear domain models, dependency injection, hexagonal/ports-and-adapters separation between LLM/tool calls and business logic, and contract tests against platform APIs and MCP tools.

Implement RAG correctly: chunking strategy, embedding choice, hybrid retrieval, grounding/citations, and eval-driven iteration. Use only platform-provided vector stores and follow data-classification and residency rules.

Integrate with the model gateway, MCP/tool registry, identity (OIDC/OAuth2/SCIM), and observability SDK rather than calling model APIs or building auth/logging directly.

Set and enforce the engineering bar across 1–2 agile teams: code reviews, design reviews, ADR (architecture decision record) discipline, definition of done, and PR standards. Block merges that skip evals, observability, or guardrails.

Skills & Technologies

Business Services & Consulting

Overview

AVP , Principal Engineer - AI Agent Engineering (L11) (6)

About the Role

AVP, Principal Engineer - Ai Agent Engineering (L11)

What You'll Do

Skills & Technologies