How to build a Scalable Open Source Data Platform - Sopht - New Architecture

Sopht Data Platform Implementation - New Architecture

Published on : April 9, 2025

|   Lastly edited on : April 9, 2025

|   10 minutes read

Lirav DUVSHANI

Lirav DUVSHANI


Project Details

Sopht

Sopht - A Green ITOps Solution

Sopht


Sopht is a French start-up building a unique Green ITOps data-driven solution to lead environmental and financial performance with automated decarbonization recommendations & guided actionability. Here's how the solution works :

  • Automated data collection : Automated and streamlined data collection across the entire IT value chain covering SaaS, hardware, usage, Cloud & On-Prem, networks, web, and more.
  • Granular & dynamic analysis : Real-time analysis of Scope 2 and 3 emissions with methodology based on LCA, GHG Protocol, ADEME, ISO...
  • Suggestions & simulations : Recommendations from the solution and simulation of the impact of the decarbonization levers.
  • Distributed steering : Capacity to build, assign and track actions plans and closely monitor the real-time impact of these actions on the decarbonization trajectory.

As you can imagine, Data is at the core of the software provided by Sopht.

Context

At the start of 2024, after a successful seed round, Sopht needed to ensure the scalability of their solution to ensure both the increase in the customer base (increase in number of customers) but also in the features provided to customers.


In this context, we audited the current state of the software architecture, mainly its data architecture and proposed a target architecture including key improvements. Our evaluation identified potential bottlenecks, inefficiencies, and areas for optimization to enhance scalability and performance.


Based on our findings, we provided strategic recommendations to align the architecture to the future business and technical needs.

The Sopht initial calculation core module

At the heart of Sopht, there is a calculation module which has all the standard features of a data platform : data collection, transformation and monitoring.

The calculation "core" module was built as a monolyte for the entirety of the transformation (cleaning, modelisation, transformation, preparation for reporting).

Initial Architecture

Tools of this initial version

The initial build of the calculation "core" module was created using :

  • MongoDB for data storage
  • PostgreSQL for data storage and compute of calculation
  • Kotlin for customer data collection (Bronze layer)
  • Python for data processing in the calculation layer
  • Angular for the Front End plugged to the PostgreSQL database
  • Docker for application packaging
  • Kubernetes Cosmos on Scaleway for container orchestration
  • Terraform for automating and managing the cloud infrastructure
  • Git for the SCM of the data pipelines and CI/CD process definition
  • Flyway for the deployment of PostgreSQL

Initial Architecture

The limitation of this version

ThemeLimitation
MonitoringNo dedicated scheduling tool was implemented, making the monitoring of the data pipelines very complex
DevelopmentNo standard structure of code was defined, creating complexity on the readability of the code and its maintenance
ScalabilityMany performance issues were present due to entanglement of data flow loading
Activating new customers was also complex, as no independent process was available.
TestingNo standard for unit testing was implemented, leaving tests solely to the Product team during functional testing
Only 2 environments available : Dev and Production, where only production had customer data, which implied many features to be tested directly in Production
ArchitectureData Cleaning, Business rules and internal parametrization were all done in the same layer without clear standard
Root Cause AnalysisRoot cause analysis was highly complex due to the lack of a standardized architecture

As the most important part of the value proposition of Sopht, a more robust scalable calculation module was to be built, to enable the expected increase in customers and features

A new scalable Data Platform but not only that...

The audit of the initial version revealed some of the pain points of the architecture and a new Data Platform was to be designed.


This new platform was to take into account the scalability and stability part as well as many other requirements presented below :

  • Stability of the platform : Ensure the reliability of the data process, minimizing downtime and disruptions
  • Scalability of the platform : Accommodate increase in data volume, in data compute needs without compromising performances
  • Usage of well-known and tested tools : Enhance the reliability and maintainability of the tools used, to limit the risks and simplify integration
  • Leveraging open-source whenever possible : Prioritize open-source technologies for their flexibility, strong community support and cost efficiency, also reducing dependency on proprietary vendors
  • Implementation of a data governance : Empower internal data stewards and define clear responsibilities upon the data
  • Anticipate the changes and communicate them : Provide the possibility to anticipate all changes done and communicate to customers with confidence
  • Standardisation of data pipeline development and architecture : Facilitate design, implementation and maintenance of data pipelines, reducing technical debt and accelerating development.
    • Implementation of best practices : Define standard naming conventions and clear separation between each data processing layer
    • Have customer specific scopes : Ensure the independence of each customer data and flow execution
    • Ensure the correctness of the data calculated : Ensure the extensive testing of the solution; through unit testing and functional testing
    • Enable data integration for structure and semi-structured data : Provide the adequate flexibility to the data platform to enable different types of format enabled in the data integration

The proposed Data Architecture

Due to the complexity of the different steps, having a single place where every data process occurred didn't match the requirements of flexibility, stability and scalability.

A medallion style architecture was proposed to ensure a clear structuring of the project, efficiency in development and root cause analysis, as well as reliability.

Target Architecture

The purpose of each layers is available in the dedicated architecture post.

The tools of the proposed Data Architecture

In this new data architecture, the tools proposed were (in bold the changes) :

  • S3 for data storage in the bronze and silver layer, using Parquet files
  • PostgreSQL for data storage and compute of calculation in the gold and platinum layer
  • Kotlin for customer data collection (Bronze layer)
  • Python and DuckDB for data processing in all layers, except Bronze (New Standard Framework for data processing in both Python and DuckDB)
  • Angular for the Front End plugged to the PostgreSQL database
  • Docker for application packaging
  • Kubernetes Cosmos on Scaleway for container orchestration
  • Terraform for automating and managing the cloud infrastructure (Changes to the deployments to be customer specific)
  • Kestra for orchestration and monitoring
  • Git for the SCM of the data pipelines and CI/CD process definition
  • Git for parametrization layer containing all the business internal parametrization
  • Flyway for the deployment of PostgreSQL

Target Architecture with tools


Dedicated articles will be available in the near future to detail the choice of these tools.

What about the data features lifecycle ?

A new strategy for environments was designed to maintain stability while allowing for flexible development.


The target proposed includes 3 environments : Development, Staging and Production. Each with a different purpose :

  • Development : Sandbox intended for developers to build, test and iterate on features
  • Staging : Pre-production environment that closely mirrors production, used for testing of new features and bug correction
  • Prod : Customer facing environment with all the validated features

Here is a detailed list of features for each environment:


FeatureDevStagingProd
Code Development✅ Yes❌ No❌ No
Testing (Unit)✅ Yes❌ No❌ No
Testing (Performance)❌ No✅ Yes❌ No
Testing (Non-regression)❌ No✅ Yes❌ No
Testing (Functional)❌ No✅ Yes❌ No
DataPartialCompleteComplete
Automatic Scheduling❌ No✅ Yes✅ Yes
Customer facing❌ No❌ No✅ Yes



This article is part of a series showcasing the design and implementation of a scalable open source Data Platform for Sopht, a French Green ITOps start-up.