Teams Data Pipeline

The Teams Data Pipeline is responsible for managing and processing teams-related data. It extracts data from the source database (MongoDB), applies necessary transformations to align with schema requirements, and loads the data into the target database (MySQL). The pipeline ensures data consistency and completeness through validation and transformation processes. Key tasks include:

  • Validating data fields for accuracy and ensuring required fields are present.

  • Mapping team and member fields to the target schema.

  • Handling data updates and upserts to maintain data integrity and avoid duplication.

The Teams Data Pipeline consists of two parallel pipelines, each responsible for processing a specific subset of the data:

  1. Teams Pipeline:

    • Handles the transformation and processing of team-related data.

    • Ensures that team details, and other relevant fields, are aligned with the target schema.

    • Responsible for maintaining the integrity of team-level data.

  2. Team Members Pipeline:

    • Processes data related to individual team members.

    • Transforms fields like team_id, username, userId and type to match the target schema.

    • Ensures accurate mapping between team members and their respective teams.

These pipelines operate independently but are orchestrated together for cohesive data handling.


Configurations for the Teams Data Pipeline are provided in a teams_data_pipeline_config.yaml format to ensure flexibility and adaptability. These configurations are designed for normal and ideal use cases and are advised to be used as-is to achieve optimal results.

  type: "mongodb"
  connection_string: "mongodb://root:Expertflow123@mongo-mongodb.ef-external.svc:27017/?authSource=admin&tls=true&tlsAllowInvalidHostnames=true"
  # connection string for replica support
  # connection_string: "mongodb://root:Expertflow123@mongo-mongodb-0.mongo-mongodb-headless.ef-external.svc.cluster.local:27017,mongo-mongodb-1.mongo-mongodb-headless.ef-external.svc.cluster.local:27017,mongo-mongodb-2.mongo-mongodb-headless.ef-external.svc.cluster.local:27017/?authSource=admin&tls=true&tlsAllowInvalidHostnames=true"

  replica_set_enabled: false
  replica_set: "expertflow"
  read_preference : "secondaryPreferred"
      database: "adminPanel"
      collection_name: "teams"
      filter: {}
      replication_key: "updatedAt"
      transformation: "transform_teams_data"
      num_batches: 50
      database: "adminPanel"
      collection_name: "teammembers"
      filter: {}
      replication_key: "updatedAt"
      transformation: "transform_team_members"
      num_batches: 50
  # TLS/SSL Configuration
  tls: true  # Set to false if you don't want to use TLS
  tls_ca_file: "/transflux/certificates/mongo_certs/mongodb-ca-cert"
  tls_cert_key_file: "/transflux/certificates/mongo_certs/client-pem"  # Includes both client certificate and private key

batch_size: 30000 # Adjust as needed

  type: "mysql"
  db_url: "mysql+pymysql://elonmusk:68i3nj7t@"
  enable_ssl: true  # Enable or disable SSL connections
  ssl_ca: "/transflux/certificates/mysql_certs/ca.pem"
  ssl_cert: "/transflux/certificates/mysql_certs/client-cert.pem"
  ssl_key: "/transflux/certificates/mysql_certs/client-key.pem"

  type: "mysql"
  db_url: "mysql+pymysql://elonmusk:68i3nj7t@"
  enable_ssl: true  # Enable or disable SSL connections
  ssl_ca: "/transflux/certificates/mysql_certs/ca.pem"
  ssl_cert: "/transflux/certificates/mysql_certs/client-cert.pem"
  ssl_key: "/transflux/certificates/mysql_certs/client-key.pem"

schedule_interval: "0 */5 * * *"

Source Configuration

This section defines the MongoDB source settings for data extraction.

  • type: Specifies the data source type.
    Example: "mongodb" indicates that MongoDB is the source.

  • connection_string:
    The connection string for MongoDB. It includes the following components:

    • username and password for authentication.

    • MongoDB host and port.

    • Optional parameters like authSource, tls, and tlsAllowInvalidHostnames.

    Example (non-replica):

    connection_string: "mongodb://root:Expertflow123@mongo-mongodb.ef-external.svc:27017/?authSource=admin&tls=true&tlsAllowInvalidHostnames=true"

    Example (replica set support):

    connection_string: "mongodb://root:Expertflow123@mongo-mongodb-0.mongo-mongodb-headless.ef-external.svc.cluster.local:27017,mongo-mongodb-1.mongo-mongodb-headless.ef-external.svc.cluster.local:27017,mongo-mongodb-2.mongo-mongodb-headless.ef-external.svc.cluster.local:27017/?authSource=admin&tls=true&tlsAllowInvalidHostnames=true"
  • replica_set_enabled:
    Indicates if the replica set is enabled.
    Example: false

  • replica_set:
    Specifies the name of the replica set if enabled.
    Example: "expertflow"

  • read_preference:
    Defines the read preference for MongoDB.
    Example: "secondaryPreferred" allows reading from secondary replicas.

  • queries:
    A dictionary containing query configurations for different pipelines.

    • teams: Configurations for the teams pipeline.

      • database: Name of the MongoDB database from where teams data is being extracted. Example: "adminPanel".

      • collection_name: Name of the MongoDB collection. Example: "teams".

      • filter: Query filter applied to fetch data. Example: {} (fetch all records).

      • replication_key: Field used to track updates. Example: "updatedAt".

      • transformation: Transformation function name. Example: "transform_teams_data".

      • num_batches: Number of data batches. Example: 50.

      • query_keys: Reserved for gold queries ( for loading data in gold table if needed ).

    • team_members: Configurations for the team members pipeline.

      • Similar fields as the teams pipeline, but for the teammembers collection.

  • TLS/SSL Configuration:
    Enables secure communication with MongoDB.

    • tls: Set to true to enable TLS. Example: true.

    • tls_ca_file: Path to the CA certificate file.
      Example: "/transflux/certificates/mongo_certs/mongodb-ca-cert".

    • tls_cert_key_file: Path to the client certificate and private key file.
      Example: "/transflux/certificates/mongo_certs/client-pem".

Batch Size

  • batch_size:
    Number of records processed per batch.
    Example: 30000.

Target Configuration

This section defines the target MySQL database settings for data loading.

  • type: Specifies the target database type.
    Example: "mysql".

  • db_url: Connection string for the target MySQL database.
    Format: "mysql+pymysql://<username>:<password>@<host>:<port>/<database>".
    Example: "mysql+pymysql://elonmusk:68i3nj7t@".

  • enable_ssl: Enables SSL communication with the MySQL database.
    Example: true.

  • SSL Configuration:

    • ssl_ca: Path to the CA certificate. Example: "/transflux/certificates/mysql_certs/ca.pem".

    • ssl_cert: Path to the client certificate. Example: "/transflux/certificates/mysql_certs/client-cert.pem".

    • ssl_key: Path to the client private key. Example: "/transflux/certificates/mysql_certs/client-key.pem".

Config Database

The configuration database (configdb) stores metadata and operational settings for airflow.

  • Fields are identical to the target configuration.

Schedule Interval

  • schedule_interval:
    Cron expression defining the pipeline's schedule in Airflow.
    Example: "0 */5 * * *" (runs every 5 hours).

