Machine Learning Engineer

Gravity IT Resources

Nashville, TN, USA

Published: 6/14/2022

Engineering

Full Time

Job Description

Machine Learning Engineer

Employment Type: Full-Time

Location: Nashville, TN (hybrid)

About the Role

We’re hiring a Maching Learning Engineer to design and deploy AI systems end-to-end — from data preparation and evaluation to model fine-tuning, inference, and agentic workflows. You’ll work closely with product and engineering teams to deliver reliable, cost-effective, and scalable LLM-powered solutions on AWS.

What You’ll Do

End-to-End GenAI Solutions: Scope problems, choose the right approach (prompt engineering, fine-tuning, agents), implement, evaluate, and deploy.
Data & SQL: Write efficient SQL for analytics and data prep; manage schemas and pipelines for model training and inference.
Model Training & Fine-Tuning: Run supervised fine-tuning (PEFT/LoRA/QLoRA), optimize prompts, and manage experiment tracking/evaluation.
Agentic Systems: Build agent workflows with tool use, memory, and safety/guardrails.
Inference & Deployment: Package services with Docker, optimize latency and cost (batching, caching, quantization), and deploy on AWS (ECS, EKS, SageMaker, Lambda with GPU acceleration).
MLOps & Observability: Set up CI/CD for models/prompts; maintain offline/online evaluation pipelines, monitoring, and rollback strategies.
Security & Compliance: Implement data governance, PHI/PII protections, and guardrails against prompt injection and unsafe outputs.
Cross-Functional Collaboration: Work with product managers and engineers to align GenAI capabilities with product goals; clearly document and communicate trade-offs.
Production Readiness: Lead conversations around scaling, monitoring, and maintaining GenAI systems in production environments.

Minimum Qualifications

5+ years of Software/ML engineering experience, including 2+ years building and deploying GenAI/LLM systems.
MS/PhD in Computer Science, Data Science, or equivalent experience.
Strong SQL and Python skills with solid software engineering fundamentals.
Experience with agent frameworks (LangGraph, AutoGen, CrewAI) and tool-driven agents.
Hands-on with deep learning (PyTorch or TensorFlow) and LLM fine-tuning (SFT/PEFT like LoRA/QLoRA).
Production experience with Docker and AWS (ECS, EKS, SageMaker, Lambda, or GPU services).
Experience building scalable data and model pipelines for training and deployment.
Familiarity with prompt engineering, evaluation frameworks (LLM-as-judge, metrics), and offline test harnesses.
Understanding of security & compliance for sensitive data (e.g., PHI/PII).
Excellent problem-solving, communication, and documentation skills.

Preferred Qualifications

Experience with inference optimization: quantization (bitsandbytes, GPTQ/AWQ), batching, caching, or vLLM.
Background in healthcare, including HIPAA compliance or medical data handling.
Experience with experiment tracking (MLflow, W&B), CI/CD for ML, and monitoring tools (Prometheus, Grafana).
Familiarity with major LLM APIs and open-source models (OpenAI, Anthropic, Llama, Mistral).

Tech Stack

Languages: Python, SQL
DL/LLM: PyTorch, TensorFlow, Hugging Face, PEFT/TRL, vLLM
Data: Snowflake, Postgres
Cloud: AWS (ECS, EKS, SageMaker, Lambda)
MLOps: Docker, CI/CD, MLflow, or W&B