HPC Systems Engineer III (National Center for Atmospheric Research)
Job posting number: #153132 (Ref:REQ-2024-186)
Job Description
Job Description Summary:
UCAR is excited to announce the job opening for a HPC Systems Engineer III role. This position is responsible for providing system engineering service and support for the Computational & Information Systems Laboratory’s (CISL) high-performance supercomputers, high-performance networks, service infrastructures (e.g., JupyterHub, Globus, containers, Open Science Data Federation (OSDF)), and storage services. The environment is composed of multi-vendor resources with numerous specialized hardware components and requires coordination and communication with the other groups and divisions within CISL.Production systems supported are located at the NSF NCAR Wyoming Supercomputing Center (NWSC) located in Cheyenne Wyoming. May be required to work at the NWSC during periods of system installation, system upgrade, or system troubleshooting.
NSF NCAR’s Computational and Information Systems Laboratory (CISL) is a leader in supercomputing and data services necessary for the advancement of atmospheric and geospace science. CISL’s mission is to remain a leader at the forefront of ensuring that research universities, NCAR, and the larger atmospheric, oceanographic, and related research communities have access to the computational resources they need for their research. To fulfill the need for a stronger workforce at the intersection of High Performance Computing (HPC) and geoscience problems, CISL engages in education and outreach activities to inspire and attract a diverse future workforce.
Position Details:
Visa Sponsored Job:
NoRelocation Assistance Eligible:
YesJob Location:
Boulder, ColoradoPosition Type & Term:
Full time, RegularCompensation Range:
Salary Range $140,000 - $175,000
*Final salary and rates are based on education, experience, skills relevant to the role.*
Application Notes
Application Deadline: This position will be posted until 11:59 PM MT on
Required application materials (preferably in PDF Format):
Resume
Cover Letter - Please address how your skills and experience meet the needs of this position (for more information, please refer to the Key Responsibilities and Knowledge, Skills, and Abilities sections of this job posting).
Questionnaire - Included in Workday application; pleas reflect on your own personal and professional experiences to provide examples
- Please share ONE specific example of how you have created an environment where your teammates feel safe to provide feedback on initiatives you are working on.
- Please explain how you have used horizontal scaling in your work.
- Please explain how you see AI tools, like large language models, potentially enhancing your work.
Partial Relocation assistance is available for this position to eligible candidate.
UCAR/NCAR will not sponsor a work visa (e.g., J-1, H1-B, etc.) for this position. U.S. Citizenship, Permanent Residency, or other protected status under 8 U.S.C. 1324b(a)(3) is required for this position
-----------------------------------------
What you will do:
As part of the High Performance Computing Systems Group (HSG), provides system engineering leadership and support for the Computational & Information Systems Laboratory’s (CISL) high-performance supercomputers, block and object storage systems, data archival systems, high-performance networks, and data transfer services. The environment is composed of multi-vendor resources with numerous specialized hardware components and requires coordination and communication with the other groups and divisions within CISL.
Primary job location is in Boulder Colorado. Production systems supported are located at the NCAR Wyoming Supercomputing Center (NWSC) located in Cheyenne Wyoming. May be required to work at the NWSC during periods of system installation, system upgrade, or system troubleshooting.
Responsibilities:
Software Engineering and Development
Develops, implements, and documents new features or capabilities in system administration and system monitoring software. Develops and maintains systems software as necessary for the deployment and management of all aspects of high-performance supercomputers, clusters, storage, and network fabrics. Develops and maintains security monitoring and analysis software. Performs installation and necessary hardware and software integration as part of the HPC infrastructure deployments and upgrades. Develops and maintains security monitoring and analysis software. Helps define group standards and guidelines for software development and documentation. Leads software development projects including requirements gathering, design, and project management. Writes code to enhance system management capabilities of the HPC infrastructure and automate repeatedly performed system administration tasks. Manages, designs, and develops bench marking tool suites for use during procurement and for ongoing performance monitoring of the high-performance computing environment. Develops acceptance testing criteria and applications for system procurement.
Research and Evaluation
Researches new and emerging technology (e.g., cloud), evaluates the potential impact of the new hardware and software technology on workflows and plans, and makes recommendations to the HPCD division and CISL management for future procurement of hardware and software products, configurations, and functional enhancements or upgrades in support of the high-performance computing environment. Performs evaluations and benchmarks, and compiles reports on new hardware and software systems related to the high-performance computing environment (i.e., computing, storage, networking).
Participates in projects relating to the high-performance computing environment and may have direct responsibility for design and procurement decisions. This may include development of systems level code to support the various aspects of the HPC infrastructure software and hardware. Participates in the RFP process by contributing to the technical specification, requirements definition, review, decision making, acceptance, and implementation for future procurement.
Operational Monitoring and Troubleshooting
Operates and monitors the behavior of the group managed supercomputers, clusters, servers, storage, and network fabrics on a routine, daily basis to ensure proper and efficient operations. Alerts other HPC Systems Group staff, vendor representatives, and/or NWSC staff of abnormal conditions or behaviors, as appropriate, and takes remedial actions as necessary. Diagnoses and may repair failed software and/or hardware components, or may mentor/assist other staff in such.
Provides service on a 7x24 on-call basis troubleshooting and resolving system related problems presented by users, other sections in CISL, and vendor-employed engineers and analysts. Refers and escalates problems to senior members of the HPC Systems Group or appropriate staff as necessary. Documents troubleshooting and operational techniques and best practices, mentors other team members when necessary.
Systems Administration
Provides systems support for diverse hardware and software architectures. Leads the installation and upgrades of system hardware and software, including computational systems, clusters, standalone machines, storage systems and a variety of network fabrics including Ethernet, InfiniBand, and Fibre Channel. Helps define standards and guidelines for operation and maintenance, and produces systems operation and procedural documentation. Compiles, installs and maintains commercial and open source application software. Documents system administration tasks and mentors other team members when necessary.
Project Management
Leads team projects utilizing standard project management tools and techniques. Under the direction of the HSG group lead, provide project coordination, technical expertise and planning for system deployment projects. Develops budgets, project timelines, and task structures for the group. May guide and review the tasks of team members and provide guidance as necessary. May participate in cross-group and cross-division projects as necessary including taking a lead role.
Organizational Representation and Reporting
Provides regular HSG activities reports to management and may contribute to CISL or NCAR annual report and development plans. Attends group, division, and laboratory meetings and may represent HSG and its activities at such meetings. May represent the group at larger organizational meetings and broader community events as appropriate.
Who We'd Love to Join Our Team
Successful candidates will ensure their application materials speak to the following criteria:
Education & Experience
Bachelor’s degree and eight to twelve years of progressive experience or equivalent combination of education and experience in one or more of the following fields: Computer Science, Mathematics, Computer/Electrical Engineering, Information Sciences, Software Engineering, or equivalent related field.
Knowledge, Skills, and Abilities
Demonstrated skill in the installation, configuration, administration, troubleshooting, and securing of compute clusters
Experience with deploying and maintaining infrastructure for hardware and software stacks for services such as Globus, JupyterHub, Kubernetes
Demonstrated skill in the configuration and troubleshooting of high-performance Ethernet fabrics
Demonstrated skill in operating container infrastructure
Demonstrated skill in common scripting and programming languages (e.g., ANSI/GNU C, Python, etc.) and general software engineering practices
Demonstrated skill in performing tasks requiring organization and attention to detail
Excellent written and verbal communication skills and the ability to write and interpret systems documentation
Communicates effectively with lab and/or program. May communicate with entire organization.
Able to explain concepts with high technical complexity to others of various technical backgrounds. This may include risks, control, and impacts.
Employs active listening to lab or program needs to create solutions to technical problems at a high level of complexity.
Makes formal presentations at lab or program level and advocates for proposed solutions.
Ability to work collaboratively with teams of different skill levels and backgrounds
Ability to mentor team members and collaborators
Ability to function effectively within a matrixed, multidisciplinary team
Maintains professional contact with members of industry and sponsors.
May interact at national level with sponsors/presentations.
Desired, but not required:
Experience with infrastructure as code solutions, such as Ansible
Experience with on-premise as well as commercial clouds
Experience with and/or interest in project and team management
Experience with infrastructure for CI/CD workflows
Experience with high-performance computing and related technologies
OTHER REQUIREMENTS:
Occasional travel to the NCAR Wyoming Supercomputer Center, which is approximately 90 miles north of Boulder
Periodic 7x24 on-call support in rotation with other staff
Providing assessment and feedback on vendor technology roadmap, RFI/RFP to the HSG group head and the HPCD division director
Work location requirements:
This position is expected to support a hybrid format (remote and in-person work) with some days each week in-person at the primary Boulder, CO office. Production systems supported are located at the NCAR Wyoming Supercomputing Center (NWSC) in Cheyenne, Wyoming and the Systems Engineer will be required to work at the NWSC to assist with new supercomputers, storage commissioning, major upgrades, outages, downtimes, etc.
Benefits Overview
UCAR affirms its commitment to employees through competitive benefits. In addition to medical, dental, vision, retirement, and life insurance, UCAR offers a variety of programs focused on work-life balance and professional, and personal development. These include:
Tuition Assistance, time off allowance to attend classes, and other professional development opportunities
UCAR contributes 10% of your eligible pay into your retirement account; 100% fully vested on day one
Starting minimum accrual of 20 days of personal time off each year (prorated for less than full-time positions)
10 paid holidays
10 days of sick leave each year
12 weeks of paid parental leave
Short-term medical leave paid at 100% of your regular salary
EcoPass for local Colorado residents to use the Denver and Boulder-area transit system at no cost
Commitment to Diversity, Equity & Inclusion
Our organization is committed to creating a diverse, equitable, and inclusive work environment and fostering a culture where everyone feels welcome and supported. To learn more about these efforts, visit the Office of Diversity, Equity & Inclusion Strategic Plan and our Diversity & Inclusion: A Welcoming Workplace site.
Research shows that women and people of color are less likely to apply for a position if they do not meet almost 100% of the desired skills and experience. Please note this is not necessary! If you meet the minimum requirements and have a passion for the work, you are encouraged to apply. We can provide on-the-job training for the rest!
Commitment to Job Application Fairness
Applicants are not required to provide age or age-related information and may redact information related to age, date of birth, or dates of attendance at or graduation from an educational institution from any submissions during the initial application process.
Some Final Considerations
At UCAR|NCAR|UCP, you will work alongside a dedicated team of professionals conducting critical research and community outreach to solve complex Earth system science problems including climate change, air pollution, extreme weather, floods, drought, wildfires, and space weather, all with the goal of improving human life and reducing economic loss. Each of us, from scientists to the professionals who support their work, serves the public and a collaborative community of scientists in our mission to understand the complex processes that make up the Earth system, from the ocean floor to the Sun’s core.
Flexible Work
At UCAR, we are committed to supporting our mission by giving staff the flexibility to find the schedule and location that works best to maintain their own work-life circumstances and reach their full potential as professionals. Many positions within our organization are eligible for fully on-site, hybrid, fully-remote and/or flexible work schedules.
Equal Opportunity Employer
UCAR is committed to providing equal opportunity for all employees and applicants for employment and does not discriminate on the basis of race, age, creed, color, religion, national origin or ancestry, sex, gender, disability, veteran status, genetic information, sexual orientation, gender identity or expression, or pregnancy. Whatever your intersection of identities, you are welcome at UCAR.
Export Control
All positions are required to comply with U.S. export compliance regulations work location requirements regarding access to facilities and research systems.
Visa Wait Times
Please consider the length of visa procurement when applying for this posting, understanding that you will not be able to begin employment until you are able to get a visa and enter the U.S.