Engineering Owner For Enterprise Workload Automation Platform
Serve as the engineering owner for New York Life's enterprise workload automation platform, ensuring the reliability, scalability, and resilience of mission-critical scheduling and batch processing services across on-premisesand cloud environments.
In this role, you will operate and enhance enterprise scheduling platforms, calendars, and workload orchestration services while designing resilient restart and recovery patterns that enable critical business processes to run predictably and consistently. You will helpestablish standards for job design, logging, monitoring, and audit readiness, supporting a secure, compliant, and automation-first operating model.
Using platform engineering and Site Reliability Engineering (SRE) practices, you will automate operational processes, define and monitor service-level objectives (SLOs), improve observability, and lead incident response efforts to minimize downtime and improve system resilience. Your work will help ensure critical workload automation services consistently meet performance expectations and business service level agreements.
Platform Engineering & Reliability
- Operate and maintain enterprise scheduling controllers and agents across on-premises and cloud environments.
- Manage calendars, SLAs, alerting, and escalation processes.
- Design resilient restart, recovery, and rerun frameworks for critical workloads.
- Define platform standards, governance practices, SLIs, and SLOs.
- Lead platform upgrades, configuration management, and lifecycle maintenance.
Automation & Operational Excellence
- Implement job-as-code and configuration-as-code practices using Git and CI/CD pipelines.
- Develop automation solutions using PowerShell, Python, APIs, SQL, Terraform, and related technologies.
- Improve workload orchestration, dependency management, and operational consistency.
- Build monitoring, dashboards, alerts, and health checks to improve visibility and reliability.
- Lead incident triage, root cause analysis, and post-incident improvements.
Observability & Operations
- Integrate workload automation platforms with monitoring and observability solutions.
- Build dashboards, metrics, and alerts to improve visibility and operational efficiency.
- Lead incident triage, root cause analysis, recovery efforts, and post-incident improvements.
- Optimize workload performance, resource utilization, and platform reliability.
Partnership & Governance
- Partner with application, cloud, database, security, and operations teams to ensure reliable workload execution.
- Provide guidance on workload design, scheduling strategies, dependency management, and error handling.
- Support audit, compliance, and operational governance requirements.
- Document standards, playbooks, and best practices while mentoring team members.
What You'll Bring
- 5–8+ years of experience in workload automation, platform engineering, Site Reliability Engineering (SRE), production operations, or related environments.
- Hands-on experience with Stonebranch preferred, or another enterprise workload automation platform such as BMC Control-M, AutoSys, ESP, CA-7, IBM Workload Scheduler (TWS), Redwood RunMyJobs, ActiveBatch, JAMS, Tidal, Automic (UC4), or OpCon.
- Experience supporting enterprise batch processing, dependency modeling, workload orchestration, SLA management, and recovery processes.
- Strong scripting and automation skills using PowerShell, Python, Bash, SQL, REST APIs, JSON, and YAML.
- Experience with Git, CI/CD pipelines, and Infrastructure as Code (Terraform, CloudFormation, AWS CDK, or similar) and automation (Ansible, JFrog Artifactory, etc).
- Strong AWS experience, including EC2, S3, Lambda, RDS/DynamoDB, VPC networking, observability, and high-availability architectures.
- Proven design and implementation of restart/rerun patterns, dependency modeling, and idempotent batch frameworks.
- Excellent coordination skills across incident and change processes, with clear, concise communication to technical and non-technical stakeholders.
- Strong troubleshooting, communication, and stakeholder management skills.
Nice to Have
- Experience in financial services, insurance, healthcare, or other highly regulated industries.
- Experience standardizing workload automation platforms and operational governance practices.
- Experience integrating workload automation platforms with enterprise monitoring and observability solutions.
- AWS, ITIL, CISSP, or related certifications.
How Success Will Be Measured
- Reduced SLA jeopardy events, breaches, and recovery times (MTTR).
- Increased adoption of standardized job templates, recovery patterns, and automated validation checks.
- Improved audit readiness through consistent logging, documentation, and evidence collection.
- Reduced manual intervention and alert noise while improving workload completion rates and platform reliability.
Pay Transparency
Salary Range: $90,000-$128,500
Overtime eligible: Exempt
Discretionary bonus eligible: Yes
Sales bonus eligible: No
Actual base salary will be determined based on several factors but not limited to individual's experience, skills, qualifications, and job location. Additionally, employees are eligible for an annual discretionary bonus. In addition to base salary, employees may also be eligible to participate in an incentive program.
Engineering Owner For Enterprise Workload Automation Platform
Serve as the engineering owner for New York Life's enterprise workload automation platform, ensuring the reliability, scalability, and resilience of mission-critical scheduling and batch processing services across on-premisesand cloud environments.
In this role, you will operate and enhance enterprise scheduling platforms, calendars, and workload orchestration services while designing resilient restart and recovery patterns that enable critical business processes to run predictably and consistently. You will helpestablish standards for job design, logging, monitoring, and audit readiness, supporting a secure, compliant, and automation-first operating model.
Using platform engineering and Site Reliability Engineering (SRE) practices, you will automate operational processes, define and monitor service-level objectives (SLOs), improve observability, and lead incident response efforts to minimize downtime and improve system resilience. Your work will help ensure critical workload automation services consistently meet performance expectations and business service level agreements.
Platform Engineering & Reliability
- Operate and maintain enterprise scheduling controllers and agents across on-premises and cloud environments.
- Manage calendars, SLAs, alerting, and escalation processes.
- Design resilient restart, recovery, and rerun frameworks for critical workloads.
- Define platform standards, governance practices, SLIs, and SLOs.
- Lead platform upgrades, configuration management, and lifecycle maintenance.
Automation & Operational Excellence
- Implement job-as-code and configuration-as-code practices using Git and CI/CD pipelines.
- Develop automation solutions using PowerShell, Python, APIs, SQL, Terraform, and related technologies.
- Improve workload orchestration, dependency management, and operational consistency.
- Build monitoring, dashboards, alerts, and health checks to improve visibility and reliability.
- Lead incident triage, root cause analysis, and post-incident improvements.
Observability & Operations
- Integrate workload automation platforms with monitoring and observability solutions.
- Build dashboards, metrics, and alerts to improve visibility and operational efficiency.
- Lead incident triage, root cause analysis, recovery efforts, and post-incident improvements.
- Optimize workload performance, resource utilization, and platform reliability.
Partnership & Governance
- Partner with application, cloud, database, security, and operations teams to ensure reliable workload execution.
- Provide guidance on workload design, scheduling strategies, dependency management, and error handling.
- Support audit, compliance, and operational governance requirements.
- Document standards, playbooks, and best practices while mentoring team members.
What You'll Bring
- 5–8+ years of experience in workload automation, platform engineering, Site Reliability Engineering (SRE), production operations, or related environments.
- Hands-on experience with Stonebranch preferred, or another enterprise workload automation platform such as BMC Control-M, AutoSys, ESP, CA-7, IBM Workload Scheduler (TWS), Redwood RunMyJobs, ActiveBatch, JAMS, Tidal, Automic (UC4), or OpCon.
- Experience supporting enterprise batch processing, dependency modeling, workload orchestration, SLA management, and recovery processes.
- Strong scripting and automation skills using PowerShell, Python, Bash, SQL, REST APIs, JSON, and YAML.
- Experience with Git, CI/CD pipelines, and Infrastructure as Code (Terraform, CloudFormation, AWS CDK, or similar) and automation (Ansible, JFrog Artifactory, etc).
- Strong AWS experience, including EC2, S3, Lambda, RDS/DynamoDB, VPC networking, observability, and high-availability architectures.
- Proven design and implementation of restart/rerun patterns, dependency modeling, and idempotent batch frameworks.
- Excellent coordination skills across incident and change processes, with clear, concise communication to technical and non-technical stakeholders.
- Strong troubleshooting, communication, and stakeholder management skills.
Nice to Have
- Experience in financial services, insurance, healthcare, or other highly regulated industries.
- Experience standardizing workload automation platforms and operational governance practices.
- Experience integrating workload automation platforms with enterprise monitoring and observability solutions.
- AWS, ITIL, CISSP, or related certifications.
How Success Will Be Measured
- Reduced SLA jeopardy events, breaches, and recovery times (MTTR).
- Increased adoption of standardized job templates, recovery patterns, and automated validation checks.
- Improved audit readiness through consistent logging, documentation, and evidence collection.
- Reduced manual intervention and alert noise while improving workload completion rates and platform reliability.
Pay Transparency
Salary Range: $90,000-$128,500
Overtime eligible: Exempt
Discretionary bonus eligible: Yes
Sales bonus eligible: No
Actual base salary will be determined based on several factors but not limited to individual's experience, skills, qualifications, and job location. Additionally, employees are eligible for an annual discretionary bonus. In addition to base salary, employees may also be eligible to participate in an incentive program.