Role Overview
This position is responsible for maintaining the stability and integrity of core software systems, with a focus on cloud-based integrations and data workflows. As the first responder to production incidents, you'll monitor system health, triage alerts, and lead resolution efforts across critical infrastructure components including Azure Logic Apps, FTP/SFTP services, and API-driven processes.
Key Responsibilities
- Monitor system performance and integration pipelines in real time, identifying and responding to anomalies or failures as they occur.
- Diagnose and resolve technical issues related to configuration, connectivity, data formatting, and workflow execution across distributed systems.
- Conduct root-cause analysis by reviewing logs, API responses, and system behavior to determine the source of incidents.
- Ensure reliable data flow between internal platforms and external partners by validating file formats, mappings, and connection settings.
- Support billing and transactional workflows by identifying data gaps, routing errors, or processing failures.
- Act as a primary contact during production incidents, assessing impact, initiating escalations, and coordinating with engineering and QA teams.
- Document incidents thoroughly, track resolution progress, and maintain accurate records in ticketing systems such as Azure DevOps or similar tools.
- Collaborate with integration specialists during client onboarding to help design and validate new data connections.
- Improve monitoring effectiveness by refining alert thresholds, dashboards, and detection rules based on operational experience.
- Develop and maintain runbooks, knowledge base articles, and troubleshooting guides to support consistent issue resolution.
- Translate technical findings into clear summaries for both technical and non-technical stakeholders.
- Identify recurring issues and work with teams to implement preventive measures, automation, and process improvements.
- Support data updates from external sources while ensuring consistency, compliance, and system integrity.
- Generate reports on incident trends, response times, and remediation outcomes to inform operational decisions.
Required Qualifications
- 2–5 years of experience in technical or production support within a software or SaaS environment.
- Proven experience with Microsoft Azure, particularly Logic Apps and associated monitoring and DevOps tools.
- Familiarity with FTP/SFTP integrations, API troubleshooting, and network-level issues such as authentication, timeouts, and DNS.
- Working knowledge of SQL and Excel for data validation, analysis, and root-cause investigation.
- Ability to interpret system logs, error messages, and performance metrics to diagnose problems.
- Strong analytical skills with a focus on precision, data integrity, and systematic problem solving.
- Experience managing multiple concurrent tasks in a fast-paced, high-pressure environment.
- Clear and effective communication skills, especially in summarizing technical issues for broader audiences.
- Track record of collaborating with engineering, QA, and operations teams to resolve system defects.
- Commitment to process, documentation, and continuous improvement—including automation opportunities.
- Organized approach to ticket management, SLAs, and structured workflows.
Preferred Qualifications
- Degree in Computer Science, Information Systems, Engineering, or a related technical field (optional).
- Background supporting billing systems, payment processors, or financial transaction platforms.
- Experience in high-volume environments with complex, multi-system integrations.
- Hands-on use of incident management or DevOps tools such as Jira, ClickUp, or Azure DevOps.
- Interest or experience in designing monitoring strategies for cloud-native applications.
- Initiative in automating repetitive tasks or improving operational workflows.
Technical Environment
Azure Logic Apps, Azure DevOps, FTP/SFTP, REST APIs, SQL, Excel, monitoring platforms, Jira, ClickUp
Work Model
This is a remote role with a flexible schedule. Availability during afternoon and evening hours is strongly preferred to support later-day monitoring and incident response cycles.