Overview

NTT DATA Services strives to hire exceptional, innovative and passionate individuals who want to grow with us. If you want to be part of an inclusive, adaptable, and forward-thinking organization, apply now.

We are currently seeking a Site Reliability Engineerto join our team in Halifax, Nova Scotia (CA-NS), Canada (CA).

The Site Reliability Engineering team (SRE) drive the reliability, recoverability and operational efficiency of this product portfolio. Reporting to the SRE Lead, key features of this role include implementing advanced observability, troubleshooting complex systems, task automation, and technical debt management.

Members of the SRE team are expected to work closely with the TAI user community on day to day usage of the products, as well with our internal development and engineering squads, and the offshore support team that provide first line support.

Candidates will have the technical skills required to support these products on a Linux platform. Prior task automation experience in at least one programming language is expected. Hands-on experience with at least one pillar of observability is required and ideally experience in defining system monitoring, not just reacting to alerts.

Responsibilities include:

  • Building and maintaining knowledge front to back of the Technology Asset Inventory product portfolio, and then specializing in one or two of its systems
  • Maximizing the availability and performance of supported systems through optimized and automated plant management, ongoing problem management, and architecture reviews with product delivery engineers
  • Reduction of the cost of support (hours of effort) through the elimination of operational issues, optimization and automation of tasks, development of operational tools and driving client self-service to minimize constraints
  • Identification and prioritization of technical debt that risks instability or creates wasteful operational toil
  • Consult with clients (the Firm’s internal development community) to maximize their productivity, including troubleshooting toolchain issues
  • Being operationally responsive, including sharing on-call rotation with the rest of a large, global team (with a time-off in lieu system)

Required Qualifications / Skills

  • 5 years of strong Linux troubleshooting skills
  • 5 years of task automation experience in any programming language
  • Practical experience of at least one pillar of observability (metrics, logs or traces)
  • Exhibit working knowledge in at least ONE of the following areas
  • Databases (Sybase, DB2, MSSQL, etc)
  • SQL
  • REST services (API)
  • Load balancing and networking
  • Performance troubleshooting and resolution
  • Confident collaboration skills

Desired Skills

  • Python development for task automation
  • Experience with site reliability engineering practices, like service level objectives (SLOs), error budgets, blameless postmortems, toil reduction
  • Prior experience creating operational dashboards (Splunk, Grafana, etc)

About NTT Data Canada

NTT Data Canada drives outcomes that keep their clients a step ahead in the digitally dynamic world. Their team of more than 50,000 professionals worldwide works with clients to address the challenges of today and tomorrow – whether it’s helping jump-start a cloud migration, reinvent the customer experience, streamline business processes or upgrade ageing infrastructure. As a division of NTT DATA Corporation, a top 10 global IT services provider with 120,000+ employees in more than 50 countries, they excel in blending IT and business expertise with decades of industry know-how. NTT Data Canada offers one of the industry’s most comprehensive services portfolios, designed to modernize business and technology to deliver the outcomes that matter most to their clients.