As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world. Colossus Cloud is at the heart of GPU Bring-up infrastructure strategy and used for all of NVIDIA's software development and QA. The cloud service offers many resource types to support the various use cases, like baremetal for development, managed k8s service for CI/CD etc. As we grow and expand into new datacenters for both new product bring-up and scale, we are looking to hire Cloud Efficiency Architect. This position involves crafting, implementing, and maintaining strong models for total cost of ownership, return on investment, and usage. The efficiency insights to Infra, collaborators and finance, help enable data-driven decisions to optimize Colossus investments. The candidate must demonstrate strong business and technical competence with cloud concepts
What you'll be doing:
Colossus Utilization & Cost Model Development: Design, build, and maintain comprehensive cost models for private cloud services, including compute, storage, network, and platform services.
Developing predictive models for Colossus resource consumption and demand, applying historical data and future projections to guide TCO predictions.
Build/Test Job Costing: Create granular cost models specifically for build and test jobs within, attributing costs to individual pipelines, projects, or teams.
Organizational (OrgN) Level Cost Allocation: Develop and refine cost allocation strategies to provide clear, actionable cost breakdowns by organizational unit, department, or business function (OrgN level).
Data Analysis & Reporting: Analyze large datasets from various Colossus to identify cost anomalies, optimization opportunities, and trends. Develop and automate reports and dashboards to visualize key cost and utilization metrics for different collaborators.
Tooling & Automation: Evaluate, implement, and leverage FinOps and cloud cost management tools to improve reporting, forecasting, and optimization capabilities. Automate data collection and reporting processes where feasible.
Collaborator Communication: Present utilization models and insights in a clear, concise, and actionable manner to technical and non-technical audiences, including senior leadership.