Industry: Mobility

From Click Ops to Self Serivce

Modernizing a Scheduling and File-Transfer Platform


Read Story
Image AI generated with Google Gemini

Please note: The English version of this success story was translated using AI to make it accessible to our international audience.

tl;dr

  • Industry: Mobility
  • UC4/Automic Legacy Setup for Scheduling and File Transfers
  Before After
Team 7 people
100% utilization
3 people
casually working on the side
Users 😡 🤩
Integrations approx. 2500 convoluted root workflows approx. 1000 standardized solutions
Delivery time per development Months to years Hours
Effort per development Person-weeks Person-hours
Improvement vs. Operations
Build vs. Run
🛠️🏃‍♂️🏃‍♀️🏃‍♂️🏃 ‍♀️🏃‍♂️🏃‍♀️🏃‍♂️🏃‍♀️🏃‍♂️</span> 🛠️🛠️🛠️🛠️🛠️🛠️🛠️🛠️🏃‍♀️🏃‍♂️
Backlog 1000 Tickets
Trending upwards
Never more than 100 Tickets
Stable
Error Rate 💥💥💥 🏝️
Request Process Ticket Ping-Pong Self-Service + Support

Situation: Traditional Click Ops

The client company had built up a comprehensive UC4/Automic installation over several years. Its original purpose was the orchestration and automation of routine tasks—typically daily and monthly batch processes. The integrated systems included SAP, Talend DI, Oracle DB, COBOL, and PHP jobs.

Automic was also used to operate SFTP integration paths with external partners (e.g., business customers, intermediary partners, loyalty program providers, etc.).

As part of the transformation to a DevOps/SRE culture, the team was moved from the infrastructure department to software development. The new engineering manager attempted to manage the burgeoning backlog by increasing the team size with Automic experts (some internal, some external). The desired results were not achieved.

Delivery times skyrocketed. The error rate was notorious. The team became a bottleneck. Software development teams (essentially the Automic team’s direct customers) complained regularly.

The team of product experts built solutions manually (essentially using “ClickOps”) in the Automic 11 Java UI. This working method was extremely error-prone. Despite high discipline, no maintainable solutions were produced. Regression errors occurred frequently. No one dared to touch existing solutions. No one dared to update the underlying system.

At the same time, vulnerabilities in the IT security of the Automic service configuration came to light. It was clear that only a painful, radical change of course could bring improvement.

The positive aspect of this complex situation was that the acute pain fostered a strong commitment and a long-term perspective. Short-term inconveniences were accepted. Periods of hardship were endured in order to achieve long-term success.

Challenge: Uncontrolled Growth and Loss of Control

  • The system had barely discernible architectural guidelines. Initially, it was completely unclear how the Automic objects (the executable units in Automic) were interconnected.
  • It was unclear what impact changes to the system would have.
  • The team had accumulated a great deal of frustration because the transition from a purely infrastructure-oriented culture to a development-oriented culture had not been completed. The negative atmosphere among the customer teams due to high error rates and long delivery times acted as a catalyst.
  • Due to the postponement of necessary modernizations and upgrades, the gap between the deployed system versions (e.g., base operating system, Automic, etc.) and the currently available versions was so large that a low-risk upgrade was no longer possible.
  • The processes running through the system were critical. Everything had to remain operational at all times. Finance workflows, in particular, could not be allowed to fail under any circumstances—but neither could other critical workflows or historically established critical workarounds.
  • Finance workflows, in particular, could not be allowed to fail under any circumstances—but neither could other critical workflows or historically grown critical workarounds.
    • In previous years, there was a strong focus on technology-centric teams, leading to the “fool with a tool” effect. Vendor tools like Automic or Talend were heavily misused as a one-size-fits-all solution—resulting in a multitude of unmaintainable custom-built solutions.

Solution: Fully Automated Internal Product

To solve the problem, several issues had to be addressed simultaneously:

  • Team Culture
  • Platform Modernization
  • Delivery Processes for New Solutions
  • Handling Existing Solutions

Team Culture

  • Shifting the team’s self-image from project team to product team
  • Creating a visual roadmap (with short-term milestones and long-term goals)
  • Backlog Cutting: Determining predictive delivery dates for solution requests. Communicating when a request is guaranteed to be fulfilled in at least one year. Rejecting the request with a request to try again if it is still relevant at that time. This radically reduces the backlog and the cognitive load on the team.
  • Intensify direct customer contact with the engineering teams and build an exceptional support culture.
  • Establish shared ownership and a shared backlog to address hidden work and team member overcommitment.
  • Implement pair and ensemble programming sessions, co-design, and peer review.
  • Grant read-only permissions to all engineers to increase system transparency and crowdsource troubleshooting for existing solutions.
  • Measure clear KPIs (Feature Lead Time, Change Frequency, MTBF, MTTR, Change Failure Rate).
  • Implement SRE incident practices to proactively identify and resolve incidents, improving MTTR and thus availability.
  • Refine architectures and interfaces with other components in the system landscape and develop clear architectural guidelines.
  • Align SLOs to ensure the team and customers are meeting requirements.
  • Gradually reduce team size (by not filling vacant positions).
  • Hands-on support for team members, especially with tedious tasks (e.g., manual migration of individual solutions), setting a good example.
  • Clear coaching of individual team members to make personal growth tangible throughout the process.
  • Building trust so that the team felt confident enough to constructively reject misleading requests and proactively suggest better solutions.
  • Side effect: The team became a sought-after hub for knowledge about cross-team or cross-product business processes.
  • Side effect: The team took on the leadership of responsible cross-team projects.

Handling Existing Solutions

  • Building an analysis solution in Clojure
    • Connecting the Automic database and the Automic Java API as data sources
    • Extracting object configurations and object relationships
    • Interactive visualization/mapping of object dependencies with d3.js
    • Creating a “wallpaper” in the office as a permanent basis for discussion
    • Classifying reusable library objects
    • Derivation of approximately 2,500 “root workflows” and their triggers (events, schedules, human operators, external references in other workflows, JCL scripts, API calls)
  • Cataloging/inventorying all 2,500 solutions
    • Clarifying ownership for all solutions
    • Obtaining owner buy-in through commitment interviews
    • Identifying the business context (and thus the departmental affiliation)
    • Assigning to the team level—with fallback to the engineering manager of the respective department
  • Defining a migration roadmap for the solutions based on a collection of Migration Templates:
    • Complete decommissioning (e.g., for low-profit use cases)
    • Replacement with an alternative technical solution (outside of Automic)
    • Standardization

Platform Modernization

  • Design of a new platform architecture with self-healing mechanisms, especially through automatic restarts and a “let it crash” approach.
  • Implementation of the company standard OpsGenie for alerting and on-call functionality
  • Setup of a currently unused demo parallel environment in the new version
  • 100% platform automation (using Ansible), clear component separation, and a unified component blueprint
  • Static compliance checks for internal components and for contracts with third parties (via ServerSpec).
  • Gradual development of a migration guide for individual solutions (with rapid test cycles in the demo environment).
  • Splitting solutions into migration batches along ownership boundaries
  • Revising interfaces with integrated systems
    • Example: Developing a significantly simplified interface to execute COBOL jobs with correct parameters as a standard Unix shell command
    • Example: Connecting the internal data shop’s reporting API (and replacing the previous convoluted integration methods)

Delivery processes for new solutions

  • Portfolio cleanup for solution components
    • Reducing the available job models to a handful of standard patterns (PHP jobs, COBOL jobs, Java jobs, API calls, etc.)
    • Replacing proprietary connectors (e.g., Automic RA Agents) with simple standard scripts, saving on licensing costs and simplifying the process
    • Replacing legacy integrations (e.g., Talend DI with Salesforce) with standard low-level tooling (e.g., Salesforce Data Loader, AWS API calls, small Go helper tools), saving on licensing costs and simplifying the process
    • Development of migration guides for existing solutions
    • Result: Reduced complexity, lower costs per solution
  • Development of a declarative domain-specific language (DSL) to describe standard use cases
    • e.g., job execution plans, job dependencies, file transfers
    • Embedded in YAML as a widely used carrier language
    • Result: Extremely accelerated lead time for solution specification and customer coordination
    • Result: Elimination of vendor lock-in on Automic
    • Result: Solution definitions are readable for ordinary software engineers (without Automic expertise)
  • Development of a solution generator along with an associated CI/CD pipeline
    • Input: JAMLs (Job Automation Manifest Language Files)
    • Output: Automic XMLs as artifacts in an artifact repository
    • Result: Technical change failure rate = 0 (but: garbage in, garbage out), lead time for solution implementation close to 0
  • Implementation Structured onboarding for SFTP connections
    • Support through automated infrastructure tests (e.g., for firewall permissions)
    • Decoupling of SFTP connection configuration from dependent SFTP file transfers (as a separate work item)
    • Result: Acceleration of setting up new SFTP connections from weeks to hours
    • Result: Reduction of round trips between team and integration partner to 1
    • Result: Reduction of the failure rate

Technologies

  • UC4/Automic
  • ansible
  • ServerSpec
  • GNU make
  • Clojure
  • d3.js
  • Java
  • Jackson YAML
  • JAXB
  • XML
  • Cobol
  • PHP
  • SAP
  • Salesforce
  • Talend DI
  • Nexus
  • Team Topologies
  • SRE
  • Tribal Leadership
  • OpsGenie
  • Jira
  • Confluence
  • Slack

Contact us