Senior MLOps Engineer
AI/ML
Remote
Contract
About the job:
Title – Senior MLOps Engineer
Start date: Immediate
Position Type: Contract/ FTE
Location: Remote across Canada/USA
About the Role:
We are looking for a highly skilled Senior MLOps Engineer to lead the development, deployment, and operationalization of an in-house, on-premise code generation system. This role requires hands-on experience with GitLab Actions and a strong foundation in LLMOps to manage the lifecycle of large language models (LLMs) effectively.
As a critical member of the team, you will design, implement, and optimize robust MLOps pipelines, ensuring the seamless operation of our code generator. Your contributions will be instrumental in integrating cutting-edge AI into our workflows while maintaining the highest standards of performance, security, and scalability.
Key Responsibilities:
GitLab CI/CD and Automation:
- Design and implement CI/CD pipelines using GitLab Actions for model and code deployment in an on-prem environment.
- Automate testing, deployment, and rollback processes for machine learning workflows.
LLMOps Expertise:
- Develop pipelines specifically tailored for managing large language models, including fine-tuning, version control, and automated deployments.
- Implement monitoring systems to track model performance, latency, drift, and data quality.
MLOps Pipeline Development:
- Build scalable pipelines for model training, evaluation, and deployment, leveraging tools such as MLflow, Kubeflow, or Airflow.
- Ensure reproducibility and traceability of experiments and models.
Infrastructure and Security:
- Architect and manage a secure, on-premise infrastructure optimized for high-performance compute environments (e.g., NVIDIA GPUs, TPUs).
- Implement robust security practices for handling sensitive data and ensure compliance with industry standards.
- Work closely with AI researchers, software engineers, and DevOps teams to integrate LLM-based code generation tools into existing systems.
- Provide guidance on best practices for LLMOps and MLOps adoption across the organization.
Optimization and Scalability:
- Optimize LLM inference and training workflows for cost, speed, and efficiency.
- Scale the system to support multiple users and high daily API calls within an on-premise setup.
Documentation and Training:
- Maintain detailed documentation for pipelines, infrastructure, and workflows.
- Train team members and stakeholders on tools and practices for effective LLM and MLOps workflows.
Key Requirements:
Technical Skills:
GitLab CI/CD:
- Hands-on experience with GitLab Actions for CI/CD pipelines in machine learning projects.
- Expertise in automating complex workflows with GitLab.
LLMOps:
- Proven experience managing large language models (LLMs) in production.
- Familiarity with fine-tuning and deploying LLMs using tools like Hugging Face Transformers or OpenAI APIs.
- Experience with continuous learning pipelines for LLMs.
MLOps Fundamentals:
- Strong skills in MLOps tools like MLflow, DVC, or Kubeflow.
- Proficient in Python and machine learning libraries such as PyTorch or TensorFlow.
Infrastructure:
- Experience with containerization and orchestration (Docker, Kubernetes).
- Knowledge of GPU-optimized workflows and distributed systems.
Soft Skills:
- Exceptional problem-solving and troubleshooting abilities.
- Strong communication and collaboration skills to work with cross-functional teams.
- Leadership qualities to mentor junior engineers and lead complex projects.
Preferred Qualifications:
- Experience with hybrid MLOps and DevOps workflows.
- Understanding of secure on-prem deployments for AI/ML systems.
- Knowledge of prompt engineering and evaluation techniques for LLMs.