Misión Imposible: migración en 4 meses
Mission Impossible: migration in 4 months

Charla para Google Developer Group

Español

Misión Imposible: migración en 4 meses

Luego de más de 20 años de funcionamiento y a raíz de los cambios suscitados por la pandemia, la empresa decidió cambiar la modalidad de trabajo para siempre, quedando en formato remoto y dejando de ocupar oficinas y con ello un data center físico.

Este data center conectaba a la empresa con sus clientes en Chile y en otros lugares del continente, para el monitoreo y supervisión, así como la mantención de sus productos y servicios. Además, se mantenían los desarrollos y plataformas internas de la compañía.

Lo que fue una idea de venta de las oficinas en el tiempo, se materializó de manera muy rápida, y el plazo original de un año (desde que se comunicó la decisión), se acortó a 4 meses, sin mayor margen de error (nos quedaríamos sin las oficinas y sin sala de servidores).

Lideré entonces un análisis de alto nivel de todo lo necesario a migrar, llegando a la conclusión de que se trataba de un trabajo de alrededor de 200 máquinas y 700 proyectos.

Con este dato duro, realicé un estudio de mercado de cómo enfrentar el problema y poder ejecutar las tareas relacionadas, que en ese momento era un total desconocimiento. Ya realizado el estudio de mercado de alto nivel, prácticamente todos los vendors cloud y sus partners me transmitieron plazos muy superiores a 4 meses para recién lograr tener un assement, lo que el trabajar con alguno de ellos era inviable.

Debía entonces reducir de alguna manera esas 200 máquinas y 700 proyectos legados identificados originalmente. Más que analizar el detalle específico, era necesario tener certeza de todas las herramientas TI en uso dentro de la empresa a la fecha, considerando la manera en que funcionan y cómo se relacionan.

Para poder tener el nivel de certeza que necesitaba era necesario construir un HLD de todas las herramientas y plataformas asociadas. Con esto en mente, ya es más sencillo identificar qué soluciones migrar a la nube, cuáles migrar a algún servicio SaaS, y cuáles montar en algún hosting. Parte de este análisis, incluía revisar escenarios híbridos, posible uso de multinube, las integraciones necesarias, el acceso de usuarios, entre otros.

Siguiente paso, el LLD, con una clara topología de la red, las máquinas virtuales en específico y todos los elementos de nube necesarios.

Con esta información, ya me fue posible cotizar los servicios en diferentes vendors cloud, como AWS, Azure, Huawei Cloud y GCP. Revisar posibles soluciones en formato SaaS, como GitLab, Influx, Grafana, Alfresco, entre otros. Así también, soluciones propietarias como la Intranet en el hosting del dominio comercial. Y tomar en cuenta el respaldar varias máquinas.

Algunas soluciones las enfrenté de manera híbrida, como el uso de GitLab, en que además de su solución SaaS, se montó en cloud recursos para la ejecución de los runners y RPMs. Esto, fue esencialmente porque hice un piloto de uso de Gitlab en donde resultó que los runner que se proveen son limitador para la descarga de diferentes archivos. Este análisis me llevó a la conclusión de que convenía mantener estas partes de la solución fuera de Gitlab. Fue necesario generar un Gitlab-Runner para garantizar la correcta descarga de todo lo necesario para cada pipeline.

Dado los cambios necesarios de conectividad, considerando ahora que las VPN de los clientes debían apuntar al cloud. Monté varias pruebas previas en que se analizaron los tiempos de conexión, latencia y pérdida de paquetes, tuneando la configuración de manera efectiva. Esto permitió considerar la mejor configuración para la gran mayoría de las conectividades necesarias. Para luego desarrollar esta tarea con cada uno de los clientes. Trabajé liderando el correcto uso de las VPN de los usuarios además de las Lan to Lan (L2L).

Algunos problemas con los que me enfrenté al momento de migrar Gilab, fueron los siguientes:

Diferentes versiones de Gitlab on-premise y SaaS: generó problemas tener versiones diferentes en la migración de los proyectos, al no ser compatibles y no existir mucha documentación para la versión on-premise (que ya tenía varios años) fue necesario comenzar la migración de proyectos uno por uno usando prueba y error.
Cantidad de proyectos: al tener una gran cantidad de proyectos el trabajo de migrar uno por uno era imposible. Se debe analizar la criticidad de los proyectos a migrar y cuáles sólo respaldar.
La migración se debe hacer exponiendo la IP de Gitlab on-premise al exterior: es riesgoso exponer toda la información de Gitlab on-premise al exterior con riesgo a tener un agujero de seguridad.
Es necesario mantener la continuidad de los desarrollos: no es posible sacar una foto de los desarrollos, migrar y luego empezar en el sitio nuevo. Es necesario hacer todo on the fly.

Finalmente logré generar un script que permitió migrar por grupo de proyectos, pero aparece un nuevo problema asociado a los permisos de cada grupo. Para enfrentar esto, se generó una rama de directorios permitiendo acceder a los diferentes proyectos. Esto fue crucial para tener funcionando las pipeline.

Una vez que ya logré tener el escenario controlado, se migró cada grupo a su nueva URL, teniendo que reorganizar todas las pipelines para su correcto desempeño.

Creo el clúster de Gitlab-Runner, partiendo por una infraestructura con capacidad estimada de prueba. Con más del 70% de los proyectos y sus pipelines funcionando, realicé los ajustes necesarios a los recursos para el correcto funcionamiento de cada una de las pipelines en recurrencia (por diferentes proyectos en paralelo).

Se modificaron las capacidades de CPU y RAM, además de cambiar el disco de HDD a SDD para mejorar la velocidad de escritura y descarga de los contenedores.

Para la mantención del clúster Gitlab-Runner se generó un script que limpia las imágenes y contenedores docker que puedan quedar en el disco o memoria que ya no se utilizan.

A un año de la migración, las estimaciones en costo mensual ha sido bastante certera. No ha sido necesario tener que hacer grandes adecuaciones a las definiciones iniciales, principalmente porque todas estas se partieron con definiciones de prueba que luego se adecuaron a los escenarios de carga reales.

El análisis de qué migrar y que respaldar funcionó perfecto. En el trabajo diario no se ha necesitado proyectos respaldados. En un plazo de 1 año, sólo se ha tenido que recurrir a un respaldo y se ha recuperado sin inconvenientes.

Inglés

Mission Impossible: migration in 4 months

After more than 20 years of operation and as a result of the changes brought about by the pandemic, the company decided to change the way it works forever, remaining in a remote format and no longer occupying offices and thus a physical data center.

This data center connected the company with its clients in Chile and other parts of the continent for monitoring, supervision, and maintenance of its products and services. In addition, the developments and internal platforms of the company are presented.

What was an idea to sell the offices over time materialized very quickly, and the original term of one year (since the decision was communicated) was shortened to 4 months, with no more significant margin of error (we would be left without offices and a server room).

I then led a high-level analysis of everything that needed to be migrated, concluding that it was a job involving around 200 machines and 700 projects.

With this complex data, I conducted a market study on how to deal with the problem and execute the related tasks, which were unknown at that time. It has already carried out a high-level market study, and practically all the cloud providers and their partners gave me terms much longer than four months to get an assessment, making working with any of them unfeasible.

It then had to own those 200 machines and 700 original identified legacy projects. Rather than analyzing the specific detail, it was necessary to be sure of all the IT tools in use within the company to date, considering how they work and are related.

To have the level of certainty he needed, it was necessary to build an HLD of all the associated tools and platforms. With this in mind, it is now easier to identify which solutions to migrate to the cloud, features to migrate to a SaaS service, and features to mount on a hosting. Part of this analysis reviewed hybrid scenarios, possible use of multi-cloud, the necessary integrations, and user access, among others.

The next step is the LLD, with clear network topology, the specific virtual machines, and all the necessary cloud elements.

I could price the services in different cloud providers, such as AWS, Azure, Huawei Cloud, and GCP, with this information. Review possible solutions in SaaS format, such as GitLab, Influx, Grafana, and Alfresco, among others. Also, proprietary solutions such as the Intranet in the hosting of the commercial domain. And take into account the support of several machines.

I faced some solutions in a hybrid way, such as the use of GitLab, in which, in addition to its SaaS solution, it was mounted on cloud resources for the execution of runners and RPMs. This was abundant because I did a pilot using Gitlab where it turned out that the runners provided are limiting for downloading different files. This analysis led me to conclude that these solution parts should be kept outside of Gitlab. It was necessary to generate a Gitlab-Runner to guarantee the correct download of everything required for each pipe.

Given the necessary connectivity changes, customer VPNs should point to the cloud now. I set up several pre-tests looking at connection times, latency, and packet loss, effectively fine-tuning the configuration. This made it possible to consider the best design for most of the necessary connectivity. To then develop this task with each of the clients. I worked on leading the correct use of VPNs for users in addition to Lan to Lan (L2L).

Some problems that I faced when migrating to GitLab were the following:

Different versions of Gitlab on-premises and SaaS: it generated problems to have different versions in the migration of the projects; since they were not compatible and there was not much documentation for the on-premises version (which was already several years old), it was necessary to start the migration of projects one by one by trial and error.

Several projects: having many projects, the work of migrating one by one was impossible. The criticality of the projects to be migrated should be analyzed and appropriate only to support.

The migration must be done by exposing the Gitlab on-premises IP to the outside: it is risky to reveal all the information from Gitlab on-premises to the outside with the risk of having a security hole.

Maintaining the continuity of developments is necessary: taking a snapshot of results, migrating, and starting on the new site is impossible. It is essential to do everything on the go.

I finally managed to generate a script that allowed migrating by a group of projects, but a new problem appears associated with the permissions of each group. To deal with this, a branch of directories was generated, allowing access to the different projects. This was crucial to get the pipelines up and running.

Once I had the scenario under control, each group was migrated to its new URL, having to reorganize all the pipelines for its correct performance.

I created the Gitlab-Runner cluster, starting with an infrastructure with an estimated testing capacity. With more than 70% of the projects and their pipelines working, I made the necessary adjustments to the resources for the correct functioning of each recurring pipeline (for different projects in parallel).

I Changed the CPU and RAM capacities and changed the disk from HDD to SSD to improve the speed of writing and downloading containers.

For the maintenance of the Gitlab-Runner cluster, a script was generated that cleans the docker images and containers that may remain on the disk or memory that is no longer used.

The monthly cost estimates have been entirely accurate one year after the migration.

It has not been necessary to make significant adjustments to the initial definitions, mainly because they were based on test definitions later adapted to real load scenarios.

The analysis of what to migrate and what to backup worked perfectly. In daily work, supported projects have not been needed. In 1 year, it has only had to resort to a backup, and it has recovered without problems.

Misión Imposible: migración en 4 mesesMission Impossible: migration in 4 months