Apache Airflow Integrations in 2026

DataHub

$75,000

See Software

Learn More

DataHub is a versatile open-source metadata platform crafted to enhance data discovery, observability, and governance within various data environments. It empowers organizations to easily find reliable data, providing customized experiences for users while avoiding disruptions through precise lineage tracking at both the cross-platform and column levels. By offering a holistic view of business, operational, and technical contexts, DataHub instills trust in your data repository. The platform features automated data quality assessments along with AI-driven anomaly detection, alerting teams to emerging issues and consolidating incident management. With comprehensive lineage information, documentation, and ownership details, DataHub streamlines the resolution of problems. Furthermore, it automates governance processes by classifying evolving assets, significantly reducing manual effort with GenAI documentation, AI-based classification, and intelligent propagation mechanisms. Additionally, DataHub's flexible architecture accommodates more than 70 native integrations, making it a robust choice for organizations seeking to optimize their data ecosystems. This makes it an invaluable tool for any organization looking to enhance their data management capabilities.

Stonebranch

150 Ratings

See Software

Learn More

Stonebranch’s Universal Automation Center (UAC) is a Hybrid IT automation platform, offering real-time management of tasks and processes within hybrid IT settings, encompassing both on-premises and cloud environments. As a versatile software platform, UAC streamlines and coordinates your IT and business operations, while ensuring the secure administration of file transfers and centralizing IT job scheduling and automation solutions. Powered by event-driven automation technology, UAC empowers you to achieve instantaneous automation throughout your entire hybrid IT landscape. Enjoy real-time hybrid IT automation for diverse environments, including cloud, mainframe, distributed, and hybrid setups. Experience the convenience of Managed File Transfers (MFT) automation, effortlessly managing and orchestrating file transfers between mainframes and systems, seamlessly connecting with AWS or Azure cloud services.

DataBuck

FirstEigen

6 Ratings

See Software

Learn More

Big Data Quality must always be verified to ensure that data is safe, accurate, and complete. Data is moved through multiple IT platforms or stored in Data Lakes. The Big Data Challenge: Data often loses its trustworthiness because of (i) Undiscovered errors in incoming data (iii). Multiple data sources that get out-of-synchrony over time (iii). Structural changes to data in downstream processes not expected downstream and (iv) multiple IT platforms (Hadoop DW, Cloud). Unexpected errors can occur when data moves between systems, such as from a Data Warehouse to a Hadoop environment, NoSQL database, or the Cloud. Data can change unexpectedly due to poor processes, ad-hoc data policies, poor data storage and control, and lack of control over certain data sources (e.g., external providers). DataBuck is an autonomous, self-learning, Big Data Quality validation tool and Data Matching tool.

Coursebox AI

Coursebox

$99 per month

79 Ratings

See Software

Learn More

Empower your content transformation with Coursebox, the leading AI-driven eLearning authoring tool. Our platform streamlines the course development process, enabling you to create a well-structured course in a matter of seconds. Once the foundation is set, you can easily refine the content and add any final touches before it's ready for deployment. Whether you're looking to distribute your course privately, sell it to a broader audience, or integrate it into your existing LMS, Coursebox makes it effortless. Designed with a mobile-first approach, Coursebox ensures that your learners stay engaged and motivated through rich, interactive content—complete with videos, quizzes, and other dynamic elements. Leverage our branded learning management system, featuring native mobile apps, to deliver a seamless learning experience. With options for custom hosting and domain personalization, Coursebox offers flexibility to meet your specific needs. Ideal for both organizations and individual educators, Coursebox simplifies the management and segmentation of learners, allowing you to craft personalized learning paths and scale your training programs quickly and efficiently.

Netdata

Netdata, Inc.

Free

20 Ratings

See Software

Monitor your servers, containers, and applications, in high-resolution and in real-time. Netdata collects metrics per second and presents them in beautiful low-latency dashboards. It is designed to run on all of your physical and virtual servers, cloud deployments, Kubernetes clusters, and edge/IoT devices, to monitor your systems, containers, and applications. It scales nicely from just a single server to thousands of servers, even in complex multi/mixed/hybrid cloud environments, and given enough disk space it can keep your metrics for years. KEY FEATURES: Collects metrics from 800+ integrations Real-Time, Low-Latency, High-Resolution Unsupervised Anomaly Detection Powerful Visualization Out of box Alerts systemd Journal Logs Explorer Low Maintenance Open and Extensible Troubleshoot slowdowns and anomalies in your infrastructure with thousands of per-second metrics, meaningful visualisations, and insightful health alarms with zero configuration. Netdata is different. Real-Time data collection and visualization. Infinite scalability baked into its design. Flexible and extremely modular. Immediately available for troubleshooting, requiring zero prior knowledge and preparation.

Sifflet

2 Ratings

See Software

Effortlessly monitor thousands of tables through machine learning-driven anomaly detection alongside a suite of over 50 tailored metrics. Ensure comprehensive oversight of both data and metadata while meticulously mapping all asset dependencies from ingestion to business intelligence. This solution enhances productivity and fosters collaboration between data engineers and consumers. Sifflet integrates smoothly with your existing data sources and tools, functioning on platforms like AWS, Google Cloud Platform, and Microsoft Azure. Maintain vigilance over your data's health and promptly notify your team when quality standards are not satisfied. With just a few clicks, you can establish essential coverage for all your tables. Additionally, you can customize the frequency of checks, their importance, and specific notifications simultaneously. Utilize machine learning-driven protocols to identify any data anomalies with no initial setup required. Every rule is supported by a unique model that adapts based on historical data and user input. You can also enhance automated processes by utilizing a library of over 50 templates applicable to any asset, thereby streamlining your monitoring efforts even further. This approach not only simplifies data management but also empowers teams to respond proactively to potential issues.

JAMS

JAMS Software

$833/month

See Software

JAMS serves as a comprehensive solution for workload automation and job scheduling, overseeing and managing workflows critical to business operations. This enterprise-grade software specializes in automating IT tasks, accommodating everything from basic batch jobs to intricate cross-platform workflows and scripts. JAMS seamlessly integrates with various enterprise technologies, enabling efficient, unattended job execution by allocating resources to execute jobs in a specific order, set time, or in response to specific triggers. With its centralized console, JAMS allows users to define, manage, and monitor essential batch processes effectively. Whether you’re executing straightforward command lines or orchestrating complex multi-step tasks that utilize ERPs, databases, and business intelligence tools, JAMS is designed to streamline your organization’s scheduling needs. Additionally, the software simplifies the transition of tasks from platforms like Windows Task Scheduler, SQL Agent, or Cron through built-in conversion tools, ensuring that jobs continue to run smoothly without requiring substantial effort during migration. Overall, JAMS empowers businesses to optimize their job scheduling processes efficiently and effectively.

Microsoft Purview

Microsoft

$0.342

See Software

Microsoft Purview serves as a comprehensive data governance platform that facilitates the management and oversight of your data across on-premises, multicloud, and software-as-a-service (SaaS) environments. With its capabilities in automated data discovery, sensitive data classification, and complete data lineage tracking, you can effortlessly develop a thorough and current representation of your data ecosystem. This empowers data users to access reliable and valuable data easily. The service provides automated identification of data lineage and classification across various sources, ensuring a cohesive view of your data assets and their interconnections for enhanced governance. Through semantic search, users can discover data using both business and technical terminology, providing insights into the location and flow of sensitive information within a hybrid data environment. By leveraging the Purview Data Map, you can lay the groundwork for effective data utilization and governance, while also automating and managing metadata from diverse sources. Additionally, it supports the classification of data using both predefined and custom classifiers, along with Microsoft Information Protection sensitivity labels, ensuring that your data governance framework is robust and adaptable. This combination of features positions Microsoft Purview as an essential tool for organizations seeking to optimize their data management strategies.

Ray

Anyscale

Free

See Software

You can develop on your laptop, then scale the same Python code elastically across hundreds or GPUs on any cloud. Ray converts existing Python concepts into the distributed setting, so any serial application can be easily parallelized with little code changes. With a strong ecosystem distributed libraries, scale compute-heavy machine learning workloads such as model serving, deep learning, and hyperparameter tuning. Scale existing workloads (e.g. Pytorch on Ray is easy to scale by using integrations. Ray Tune and Ray Serve native Ray libraries make it easier to scale the most complex machine learning workloads like hyperparameter tuning, deep learning models training, reinforcement learning, and training deep learning models. In just 10 lines of code, you can get started with distributed hyperparameter tune. Creating distributed apps is hard. Ray is an expert in distributed execution.

Dagster

Dagster Labs

$0

See Software

Dagster is the cloud-native open-source orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. It is the platform of choice data teams responsible for the development, production, and observation of data assets. With Dagster, you can focus on running tasks, or you can identify the key assets you need to create using a declarative approach. Embrace CI/CD best practices from the get-go: build reusable components, spot data quality issues, and flag bugs early.

Oxla

$50 per CPU core / monthly

See Software

Designed specifically for optimizing compute, memory, and storage, Oxla serves as a self-hosted data warehouse that excels in handling large-scale, low-latency analytics while providing strong support for time-series data. While cloud data warehouses may suit many, they are not universally applicable; as operations expand, the ongoing costs of cloud computing can surpass initial savings on infrastructure, particularly in regulated sectors that demand comprehensive data control beyond mere VPC and BYOC setups. Oxla surpasses both traditional and cloud-based warehouses by maximizing efficiency, allowing for the scalability of expanding datasets with predictable expenses, whether on-premises or in various cloud environments. Deployment, execution, and maintenance of Oxla can be easily managed using Docker and YAML, enabling a range of workloads to thrive within a singular, self-hosted data warehouse. In this way, Oxla provides a tailored solution for organizations seeking both efficiency and control in their data management strategies.

emma

On demand

See Software

Emma gives you the ability to select the most suitable cloud providers and environments, allowing for adaptation to evolving demands while maintaining simplicity and control. It streamlines cloud management by integrating services and automating essential tasks, thereby minimizing complexity. The platform also enhances cloud resource optimization automatically, guaranteeing full utilization and lowering overhead costs. By supporting open standards, it offers flexibility that liberates businesses from dependency on specific vendors. With real-time monitoring and optimization of data traffic, it effectively prevents unexpected cost spikes through efficient resource allocation. You can establish your cloud infrastructure across various providers and environments, whether on-premises, private, hybrid, or public. Management of your consolidated cloud environment is made easy through a single, user-friendly interface. Additionally, you can gain crucial visibility to enhance infrastructure performance and reduce expenditures. By reclaiming control over your entire cloud ecosystem, you can also ensure compliance with regulatory standards while fostering innovation and growth. This comprehensive approach empowers businesses to stay competitive in an ever-changing digital landscape.

intermix.io

Intermix.io

$295 per month

See Software

Gather metadata from your data warehouse along with associated tools to monitor the workloads that are important to you, enabling a retrospective analysis of user interaction, expenses, and the efficiency of your data products. Achieve comprehensive insight into your data ecosystem, including who interacts with your data and the methods of usage. In our discussions, we highlight how various data teams successfully develop and implement data products within their organizations. We delve into technological frameworks, best practices, and valuable insights gained along the way. intermix.io offers a seamless solution for obtaining complete visibility through an intuitive SaaS dashboard. You can engage with your entire team, generate tailored reports, and access all necessary information to grasp the dynamics of your data platform, including your cloud data warehouse and its connected tools. intermix.io simplifies the process of collecting metadata from your data warehouse without requiring any coding skills. Importantly, we do not need to access any data stored within your data warehouse, ensuring your information remains secure while you focus on maximizing its potential. This approach not only enhances data governance but also empowers teams to make informed decisions based on accurate and timely data insights.

IRI FieldShield

IRI, The CoSort Company

See Software

IRI FieldShield® is a powerful and affordable data discovery and de-identification package for masking PII, PHI, PAN and other sensitive data in structured and semi-structured sources. Front-ended in a free Eclipse-based design environment, FieldShield jobs classify, profile, scan, and de-identify data at rest (static masking). Use the FieldShield SDK or proxy-based application to secure data in motion (dynamic data masking). The usual method for masking RDB and other flat files (CSV, Excel, LDIF, COBOL, etc.) is to classify it centrally, search for it globally, and automatically mask it in a consistent way using encryption, pseudonymization, redaction or other functions to preserve realism and referential integrity in production or test environments. Use FieldShield to make test data, nullify breaches, or comply with GDPR. HIPAA. PCI, PDPA, PCI-DSS and other laws. Audit through machine- and human-readable search reports, job logs and re-ID risks scores. Optionally mask data when you map it; FieldShield functions can also run in IRI Voracity ETL and federation, migration, replication, subsetting, and analytic jobs. To mask DB clones run FieldShield in Windocks, Actifio or Commvault. Call it from CI/CD pipelines and apps.

Prophecy

$299 per month

See Software

Prophecy expands accessibility for a wider range of users, including visual ETL developers and data analysts, by allowing them to easily create pipelines through a user-friendly point-and-click interface combined with a few SQL expressions. While utilizing the Low-Code designer to construct workflows, you simultaneously generate high-quality, easily readable code for Spark and Airflow, which is then seamlessly integrated into your Git repository. The platform comes equipped with a gem builder, enabling rapid development and deployment of custom frameworks, such as those for data quality, encryption, and additional sources and targets that enhance the existing capabilities. Furthermore, Prophecy ensures that best practices and essential infrastructure are offered as managed services, simplifying your daily operations and overall experience. With Prophecy, you can achieve high-performance workflows that leverage the cloud's scalability and performance capabilities, ensuring that your projects run efficiently and effectively. This powerful combination of features makes it an invaluable tool for modern data workflows.

BentoML

Free

See Software

Deploy your machine learning model in the cloud within minutes using a consolidated packaging format that supports both online and offline operations across various platforms. Experience a performance boost with throughput that is 100 times greater than traditional flask-based model servers, achieved through our innovative micro-batching technique. Provide exceptional prediction services that align seamlessly with DevOps practices and integrate effortlessly with widely-used infrastructure tools. The unified deployment format ensures high-performance model serving while incorporating best practices for DevOps. This service utilizes the BERT model, which has been trained with the TensorFlow framework to effectively gauge the sentiment of movie reviews. Our BentoML workflow eliminates the need for DevOps expertise, automating everything from prediction service registration to deployment and endpoint monitoring, all set up effortlessly for your team. This creates a robust environment for managing substantial ML workloads in production. Ensure that all models, deployments, and updates are easily accessible and maintain control over access through SSO, RBAC, client authentication, and detailed auditing logs, thereby enhancing both security and transparency within your operations. With these features, your machine learning deployment process becomes more efficient and manageable than ever before.

Ascend

$0.98 per DFC

See Software

Ascend provides data teams with a streamlined and automated platform that allows them to ingest, transform, and orchestrate their entire data engineering and analytics workloads at an unprecedented speed, achieving results ten times faster than before. This tool empowers teams that are often hindered by bottlenecks to effectively build, manage, and enhance the ever-growing volume of data workloads they face. With the support of DataAware intelligence, Ascend operates continuously in the background to ensure data integrity and optimize data workloads, significantly cutting down maintenance time by as much as 90%. Users can effortlessly create, refine, and execute data transformations through Ascend’s versatile flex-code interface, which supports the use of multiple programming languages such as SQL, Python, Java, and Scala interchangeably. Additionally, users can quickly access critical metrics including data lineage, data profiles, job and user logs, and system health indicators all in one view. Ascend also offers native connections to a continually expanding array of common data sources through its Flex-Code data connectors, ensuring seamless integration. This comprehensive approach not only enhances efficiency but also fosters stronger collaboration among data teams.

DQOps

$499 per month

See Software

DQOps is a data quality monitoring platform for data teams that helps detect and address quality issues before they impact your business. Track data quality KPIs on data quality dashboards and reach a 100% data quality score. DQOps helps monitor data warehouses and data lakes on the most popular data platforms. DQOps offers a built-in list of predefined data quality checks verifying key data quality dimensions. The extensibility of the platform allows you to modify existing checks or add custom, business-specific checks as needed. The DQOps platform easily integrates with DevOps environments and allows data quality definitions to be stored in a source repository along with the data pipeline code.

Decube

See Software

Decube is a comprehensive data management platform designed to help organizations manage their data observability, data catalog, and data governance needs. Our platform is designed to provide accurate, reliable, and timely data, enabling organizations to make better-informed decisions. Our data observability tools provide end-to-end visibility into data, making it easier for organizations to track data origin and flow across different systems and departments. With our real-time monitoring capabilities, organizations can detect data incidents quickly and reduce their impact on business operations. The data catalog component of our platform provides a centralized repository for all data assets, making it easier for organizations to manage and govern data usage and access. With our data classification tools, organizations can identify and manage sensitive data more effectively, ensuring compliance with data privacy regulations and policies. The data governance component of our platform provides robust access controls, enabling organizations to manage data access and usage effectively. Our tools also allow organizations to generate audit reports, track user activity, and demonstrate compliance with regulatory requirements.

ZenML

Free

See Software

Simplify your MLOps pipelines. ZenML allows you to manage, deploy and scale any infrastructure. ZenML is open-source and free. Two simple commands will show you the magic. ZenML can be set up in minutes and you can use all your existing tools. ZenML interfaces ensure your tools work seamlessly together. Scale up your MLOps stack gradually by changing components when your training or deployment needs change. Keep up to date with the latest developments in the MLOps industry and integrate them easily. Define simple, clear ML workflows and save time by avoiding boilerplate code or infrastructure tooling. Write portable ML codes and switch from experiments to production in seconds. ZenML's plug and play integrations allow you to manage all your favorite MLOps software in one place. Prevent vendor lock-in by writing extensible, tooling-agnostic, and infrastructure-agnostic code.

Kedro

Free

See Software

Kedro serves as a robust framework for establishing clean data science practices. By integrating principles from software engineering, it enhances the efficiency of machine-learning initiatives. Within a Kedro project, you will find a structured approach to managing intricate data workflows and machine-learning pipelines. This allows you to minimize the time spent on cumbersome implementation tasks and concentrate on addressing innovative challenges. Kedro also standardizes the creation of data science code, fostering effective collaboration among team members in problem-solving endeavors. Transitioning smoothly from development to production becomes effortless with exploratory code that can evolve into reproducible, maintainable, and modular experiments. Additionally, Kedro features a set of lightweight data connectors designed to facilitate the saving and loading of data across various file formats and storage systems, making data management more versatile and user-friendly. Ultimately, this framework empowers data scientists to work more effectively and with greater confidence in their projects.

Secoda

$50 per user per month

See Software

With Secoda AI enhancing your metadata, you can effortlessly obtain contextual search results spanning your tables, columns, dashboards, metrics, and queries. This innovative tool also assists in generating documentation and queries from your metadata, which can save your team countless hours that would otherwise be spent on tedious tasks and repetitive data requests. You can easily conduct searches across all columns, tables, dashboards, events, and metrics with just a few clicks. The AI-driven search functionality allows you to pose any question regarding your data and receive quick, relevant answers. By integrating data discovery seamlessly into your workflow through our API, you can perform bulk updates, label PII data, manage technical debt, create custom integrations, pinpoint underutilized resources, and much more. By eliminating manual errors, you can establish complete confidence in your knowledge repository, ensuring that your team has the most accurate and reliable information at their fingertips. This transformative approach not only enhances productivity but also fosters a more informed decision-making process throughout your organization.

Yandex Data Proc

Yandex

$0.19 per hour

See Software

You determine the cluster size, node specifications, and a range of services, while Yandex Data Proc effortlessly sets up and configures Spark, Hadoop clusters, and additional components. Collaboration is enhanced through the use of Zeppelin notebooks and various web applications via a user interface proxy. You maintain complete control over your cluster with root access for every virtual machine. Moreover, you can install your own software and libraries on active clusters without needing to restart them. Yandex Data Proc employs instance groups to automatically adjust computing resources of compute subclusters in response to CPU usage metrics. Additionally, Data Proc facilitates the creation of managed Hive clusters, which helps minimize the risk of failures and data loss due to metadata issues. This service streamlines the process of constructing ETL pipelines and developing models, as well as managing other iterative operations. Furthermore, the Data Proc operator is natively integrated into Apache Airflow, allowing for seamless orchestration of data workflows. This means that users can leverage the full potential of their data processing capabilities with minimal overhead and maximum efficiency.

DoubleCloud

$0.024 per 1 GB per month

See Software

Optimize your time and reduce expenses by simplifying data pipelines using hassle-free open source solutions. Covering everything from data ingestion to visualization, all components are seamlessly integrated, fully managed, and exceptionally reliable, ensuring your engineering team enjoys working with data. You can opt for any of DoubleCloud’s managed open source services or take advantage of the entire platform's capabilities, which include data storage, orchestration, ELT, and instantaneous visualization. We offer premier open source services such as ClickHouse, Kafka, and Airflow, deployable on platforms like Amazon Web Services or Google Cloud. Our no-code ELT tool enables real-time data synchronization between various systems, providing a fast, serverless solution that integrates effortlessly with your existing setup. With our managed open-source data visualization tools, you can easily create real-time visual representations of your data through interactive charts and dashboards. Ultimately, our platform is crafted to enhance the daily operations of engineers, making their tasks more efficient and enjoyable. This focus on convenience is what sets us apart in the industry.

Tobiko

Free

See Software

Tobiko is an advanced data transformation platform designed to accelerate data delivery while enhancing efficiency and minimizing errors, all while maintaining compatibility with existing databases. It enables developers to create a development environment without the need to rebuild the entire Directed Acyclic Graph (DAG), as it smartly alters only the necessary components. When a new column is added, there's no requirement to reconstruct everything; the modifications you've made are already in place. Tobiko allows for instant promotion to production without requiring you to redo any of your previous work. It eliminates the hassle of debugging complex Jinja templates by allowing you to define your models directly in SQL. Whether at a startup or a large enterprise, Tobiko scales to meet the needs of any organization. It comprehends the SQL you create and enhances developer efficiency by identifying potential issues during the compilation process. Additionally, comprehensive audits and data comparisons offer validation, ensuring the reliability of the datasets produced. Each modification is carefully analyzed and categorized as either breaking or non-breaking, providing clarity on the impact of changes. In the event of errors, teams can conveniently roll back to previous versions, effectively minimizing production downtime and maintaining operational continuity. This seamless integration of features makes Tobiko not only a tool for data transformation but also a partner in fostering a more productive development environment.

Stackable

Free

See Software

The Stackable data platform was crafted with a focus on flexibility and openness. It offers a carefully selected range of top-notch open source data applications, including Apache Kafka, Apache Druid, Trino, and Apache Spark. Unlike many competitors that either promote their proprietary solutions or enhance vendor dependence, Stackable embraces a more innovative strategy. All data applications are designed to integrate effortlessly and can be added or removed with remarkable speed. Built on Kubernetes, it is capable of operating in any environment, whether on-premises or in the cloud. To initiate your first Stackable data platform, all you require is stackablectl along with a Kubernetes cluster. In just a few minutes, you will be poised to begin working with your data. You can set up your one-line startup command right here. Much like kubectl, stackablectl is tailored for seamless interaction with the Stackable Data Platform. Utilize this command line tool for deploying and managing stackable data applications on Kubernetes. With stackablectl, you have the ability to create, delete, and update components efficiently, ensuring a smooth operational experience for your data management needs. The versatility and ease of use make it an excellent choice for developers and data engineers alike.

Ardent

Free

See Software

Ardent (available at tryardent.com) is a cutting-edge platform for AI data engineering that simplifies the building, maintenance, and scaling of data pipelines with minimal human input. Users can simply issue commands in natural language, while the system autonomously manages implementation, infers schemas, tracks lineage, and resolves errors. With its preconfigured ingestors, Ardent enables seamless connections to various data sources, including warehouses, orchestration systems, and databases, typically within 30 minutes. Additionally, it provides automated debugging capabilities by accessing web resources and documentation, having been trained on countless real engineering tasks to effectively address complex pipeline challenges without any manual intervention. Designed for production environments, Ardent adeptly manages numerous tables and pipelines at scale, executes parallel jobs, initiates self-healing workflows, and ensures data quality through monitoring, all while facilitating operations via APIs or a user interface. This unique approach not only enhances efficiency but also empowers teams to focus on strategic decision-making rather than routine technical tasks.

Apache Druid

Druid

See Software

Apache Druid is a distributed data storage solution that is open source. Its fundamental architecture merges concepts from data warehouses, time series databases, and search technologies to deliver a high-performance analytics database capable of handling a diverse array of applications. By integrating the essential features from these three types of systems, Druid optimizes its ingestion process, storage method, querying capabilities, and overall structure. Each column is stored and compressed separately, allowing the system to access only the relevant columns for a specific query, which enhances speed for scans, rankings, and groupings. Additionally, Druid constructs inverted indexes for string data to facilitate rapid searching and filtering. It also includes pre-built connectors for various platforms such as Apache Kafka, HDFS, and AWS S3, as well as stream processors and others. The system adeptly partitions data over time, making queries based on time significantly quicker than those in conventional databases. Users can easily scale resources by simply adding or removing servers, and Druid will manage the rebalancing automatically. Furthermore, its fault-tolerant design ensures resilience by effectively navigating around any server malfunctions that may occur. This combination of features makes Druid a robust choice for organizations seeking efficient and reliable real-time data analytics solutions.

AT&T Alien Labs Open Threat Exchange

AT&T Cybersecurity

See Software

The largest open threat intelligence community in the world fosters a collaborative defense through actionable threat data powered by its members. In the realm of cybersecurity, threat sharing often remains disorganized and casual, leading to significant gaps and challenges in response efforts. Our goal is to facilitate the rapid collection and dissemination of relevant, timely, and accurate information regarding new or ongoing cyber threats among companies and government entities, helping to avert major breaches or reduce the impact of attacks. The Alien Labs Open Threat Exchange (OTX™) transforms this ambition into reality by offering the first truly accessible threat intelligence community. OTX grants open access to a worldwide network of security professionals and threat researchers, boasting over 100,000 contributors from 140 nations who provide more than 19 million threat indicators each day. By delivering data generated by the community, OTX promotes collaborative investigations and streamlines the updating of security systems, ensuring that organizations remain resilient against evolving threats. This community-driven approach not only enhances collective knowledge but also strengthens overall cyber defense capabilities across the globe.

CrateDB

See Software

The enterprise database for time series, documents, and vectors. Store any type data and combine the simplicity and scalability NoSQL with SQL. CrateDB is a distributed database that runs queries in milliseconds regardless of the complexity, volume, and velocity.

Beats

Elastic

$16 per month

See Software

Beats serves as a free and accessible platform designed specifically for single-purpose data shippers that transport data from numerous machines and systems to Logstash or Elasticsearch. These open-source data shippers are installed as agents on your servers, enabling the seamless transfer of operational data to Elasticsearch. Elastic offers Beats to facilitate the collection of data and event logs efficiently. Data can be directed to Elasticsearch or routed through Logstash, allowing for additional processing and enhancement before visualization in Kibana. If you're eager to start monitoring infrastructure metrics and centralizing log analytics swiftly, the Metrics app and Logs app in Kibana are excellent resources to explore. For comprehensive guidance, refer to Analyze metrics and Monitor logs. Filebeat simplifies the process of collecting data from various sources, including security devices, cloud environments, containers, and hosts, by providing a lightweight solution to forward and centralize logs and files. This flexibility ensures that you can maintain an organized and efficient data pipeline regardless of the complexity of your infrastructure.

IRI Voracity

IRI, The CoSort Company

See Software

IRI Voracity is an end-to-end software platform for fast, affordable, and ergonomic data lifecycle management. Voracity speeds, consolidates, and often combines the key activities of data discovery, integration, migration, governance, and analytics in a single pane of glass, built on Eclipse™. Through its revolutionary convergence of capability and its wide range of job design and runtime options, Voracity bends the multi-tool cost, difficulty, and risk curves away from megavendor ETL packages, disjointed Apache projects, and specialized software. Voracity uniquely delivers the ability to perform data: * profiling and classification * searching and risk-scoring * integration and federation * migration and replication * cleansing and enrichment * validation and unification * masking and encryption * reporting and wrangling * subsetting and testing Voracity runs on-premise, or in the cloud, on physical or virtual machines, and its runtimes can also be containerized or called from real-time applications or batch jobs.

Datakin

$2 per month

See Software

Uncover the hidden order within your intricate data landscape and consistently know where to seek solutions. Datakin seamlessly tracks data lineage, presenting your entire data ecosystem through an engaging visual graph. This visualization effectively highlights the upstream and downstream connections associated with each dataset. The Duration tab provides an overview of a job’s performance in a Gantt-style chart, complemented by its upstream dependencies, which simplifies the identification of potential bottlenecks. When it's essential to determine the precise moment a breaking change occurs, the Compare tab allows you to observe how your jobs and datasets have evolved between different runs. Occasionally, jobs that complete successfully may yield poor output. The Quality tab reveals crucial data quality metrics and their fluctuations over time, making anomalies starkly apparent. By facilitating the swift identification of root causes for issues, Datakin also plays a vital role in preventing future complications from arising. This proactive approach ensures that your data remains reliable and efficient in supporting your business needs.

Google Cloud Composer

Google

$0.074 per vCPU hour

See Software

The managed features of Cloud Composer, along with its compatibility with Apache Airflow, enable you to concentrate on crafting, scheduling, and overseeing your workflows rather than worrying about resource provisioning. Its seamless integration with various Google Cloud products such as BigQuery, Dataflow, Dataproc, Datastore, Cloud Storage, Pub/Sub, and AI Platform empowers users to orchestrate their data pipelines effectively. You can manage your workflows from a single orchestration tool, regardless of whether your pipeline operates on-premises, in multiple clouds, or entirely within Google Cloud. This solution simplifies your transition to the cloud and supports a hybrid data environment by allowing you to orchestrate workflows that span both on-premises setups and the public cloud. By creating workflows that interconnect data, processing, and services across different cloud platforms, you can establish a cohesive data ecosystem that enhances efficiency and collaboration. Additionally, this unified approach not only streamlines operations but also optimizes resource utilization across various environments.

Amazon MWAA

Amazon

$0.49 per hour

See Software

Amazon Managed Workflows for Apache Airflow (MWAA) is a service that simplifies the orchestration of Apache Airflow, allowing users to efficiently establish and manage comprehensive data pipelines in the cloud at scale. Apache Airflow itself is an open-source platform designed for the programmatic creation, scheduling, and oversight of workflows, which are sequences of various processes and tasks. By utilizing Managed Workflows, users can leverage Airflow and Python to design workflows while eliminating the need to handle the complexities of the underlying infrastructure, ensuring scalability, availability, and security. This service adapts its workflow execution capabilities automatically to align with user demands and incorporates AWS security features, facilitating swift and secure data access. Overall, MWAA empowers organizations to focus on their data processes without the burden of infrastructure management.

Telmai

See Software

A low-code, no-code strategy enhances data quality management. This software-as-a-service (SaaS) model offers flexibility, cost-effectiveness, seamless integration, and robust support options. It maintains rigorous standards for encryption, identity management, role-based access control, data governance, and compliance. Utilizing advanced machine learning algorithms, it identifies anomalies in row-value data, with the capability to evolve alongside the unique requirements of users' businesses and datasets. Users can incorporate numerous data sources, records, and attributes effortlessly, making the platform resilient to unexpected increases in data volume. It accommodates both batch and streaming processing, ensuring that data is consistently monitored to provide real-time alerts without affecting pipeline performance. The platform offers a smooth onboarding, integration, and investigation process, making it accessible to data teams aiming to proactively spot and analyze anomalies as they arise. With a no-code onboarding process, users can simply connect to their data sources and set their alerting preferences. Telmai intelligently adapts to data patterns, notifying users of any significant changes, ensuring that they remain informed and prepared for any data fluctuations.

Chalk

Free

See Software

Experience robust data engineering processes free from the challenges of infrastructure management. By utilizing straightforward, modular Python, you can define intricate streaming, scheduling, and data backfill pipelines with ease. Transition from traditional ETL methods and access your data instantly, regardless of its complexity. Seamlessly blend deep learning and large language models with structured business datasets to enhance decision-making. Improve forecasting accuracy using up-to-date information, eliminate the costs associated with vendor data pre-fetching, and conduct timely queries for online predictions. Test your ideas in Jupyter notebooks before moving them to a live environment. Avoid discrepancies between training and serving data while developing new workflows in mere milliseconds. Monitor all of your data operations in real-time to effortlessly track usage and maintain data integrity. Have full visibility into everything you've processed and the ability to replay data as needed. Easily integrate with existing tools and deploy on your infrastructure, while setting and enforcing withdrawal limits with tailored hold periods. With such capabilities, you can not only enhance productivity but also ensure streamlined operations across your data ecosystem.

Foundational

See Software

Detect and address code and optimization challenges in real-time, mitigate data incidents before deployment, and oversee data-affecting code modifications comprehensively—from the operational database to the user interface dashboard. With automated, column-level data lineage tracing the journey from the operational database to the reporting layer, every dependency is meticulously examined. Foundational automates the enforcement of data contracts by scrutinizing each repository in both upstream and downstream directions, directly from the source code. Leverage Foundational to proactively uncover code and data-related issues, prevent potential problems, and establish necessary controls and guardrails. Moreover, implementing Foundational can be achieved in mere minutes without necessitating any alterations to the existing codebase, making it an efficient solution for organizations. This streamlined setup promotes quicker response times to data governance challenges.

Orchestra

See Software

Orchestra serves as a Comprehensive Control Platform for Data and AI Operations, aimed at empowering data teams to effortlessly create, deploy, and oversee workflows. This platform provides a declarative approach that merges coding with a graphical interface, enabling users to develop workflows at a tenfold speed while cutting maintenance efforts by half. Through its real-time metadata aggregation capabilities, Orchestra ensures complete data observability, facilitating proactive alerts and swift recovery from any pipeline issues. It smoothly integrates with a variety of tools such as dbt Core, dbt Cloud, Coalesce, Airbyte, Fivetran, Snowflake, BigQuery, Databricks, and others, ensuring it fits well within existing data infrastructures. With a modular design that accommodates AWS, Azure, and GCP, Orchestra proves to be a flexible option for businesses and growing organizations looking to optimize their data processes and foster confidence in their AI ventures. Additionally, its user-friendly interface and robust connectivity options make it an essential asset for organizations striving to harness the full potential of their data ecosystems.

OpenMetadata

See Software

OpenMetadata serves as a comprehensive, open platform for unifying metadata, facilitating data discovery, observability, and governance through a single interface. By utilizing a Unified Metadata Graph alongside over 80 ready-to-use connectors, it aggregates metadata from various sources such as databases, pipelines, BI tools, and ML systems, thereby offering an extensive context for teams to effectively search, filter, and visualize assets throughout their organization. The platform is built on an API- and schema-first architecture, which provides flexible metadata entities and relationships, allowing organizations to tailor their metadata structure with precision. Comprising only four essential system components, OpenMetadata is crafted for straightforward installation and operation, ensuring scalable performance that empowers both technical and non-technical users to work together seamlessly on discovery, lineage tracking, quality assurance, observability, collaboration, and governance tasks without the need for intricate infrastructure. This versatility makes it an invaluable tool for organizations aiming to harness their data assets more effectively.

Zipher

See Software

Zipher is an innovative optimization platform that autonomously enhances the performance and cost-effectiveness of workloads on Databricks by removing the need for manual tuning and resource management, all while making real-time adjustments to clusters. Utilizing advanced proprietary machine learning algorithms, Zipher features a unique Spark-aware scaler that actively learns from and profiles workloads to determine the best resource allocations, optimize configurations for each job execution, and fine-tune various settings such as hardware, Spark configurations, and availability zones, thereby maximizing operational efficiency and minimizing waste. The platform continuously tracks changing workloads to modify configurations, refine scheduling, and distribute shared compute resources effectively to adhere to service level agreements (SLAs), while also offering comprehensive cost insights that dissect expenses related to Databricks and cloud services, enabling teams to pinpoint significant cost influencers. Furthermore, Zipher ensures smooth integration with major cloud providers like AWS, Azure, and Google Cloud, and is compatible with popular orchestration and infrastructure-as-code (IaC) tools, making it a versatile solution for various cloud environments. Its ability to adaptively respond to workload changes sets Zipher apart as a crucial tool for organizations striving to optimize their cloud operations.

Mode

Mode Analytics

See Software

Gain insights into user interactions with your product and pinpoint areas of opportunity to guide your product strategy. Mode enables a single Stitch analyst to accomplish what typically requires an entire data team by offering rapid, adaptable, and collaborative tools. Create dashboards that track annual revenue and utilize chart visualizations to quickly spot anomalies. Develop well-crafted reports suitable for investors or facilitate collaboration by sharing your analyses with different teams. Integrate your complete technology ecosystem with Mode to uncover upstream problems and enhance overall performance. Accelerate cross-team workflows using APIs and webhooks. By analyzing user engagement, you can discover opportunity areas that help refine your product decisions. Additionally, utilize insights from marketing and product data to address vulnerabilities in your sales funnel, optimize landing-page efficiency, and anticipate churn before it occurs, ensuring proactive measures are in place.

IBM Databand

IBM

See Software

Keep a close eye on your data health and the performance of your pipelines. Achieve comprehensive oversight for pipelines utilizing cloud-native technologies such as Apache Airflow, Apache Spark, Snowflake, BigQuery, and Kubernetes. This observability platform is specifically designed for Data Engineers. As the challenges in data engineering continue to escalate due to increasing demands from business stakeholders, Databand offers a solution to help you keep pace. With the rise in the number of pipelines comes greater complexity. Data engineers are now handling more intricate infrastructures than they ever have before while also aiming for quicker release cycles. This environment makes it increasingly difficult to pinpoint the reasons behind process failures, delays, and the impact of modifications on data output quality. Consequently, data consumers often find themselves frustrated by inconsistent results, subpar model performance, and slow data delivery. A lack of clarity regarding the data being provided or the origins of failures fosters ongoing distrust. Furthermore, pipeline logs, errors, and data quality metrics are often gathered and stored in separate, isolated systems, complicating the troubleshooting process. To address these issues effectively, a unified observability approach is essential for enhancing trust and performance in data operations.

MaxPatrol

Positive Technologies

See Software

MaxPatrol is designed to oversee vulnerabilities and ensure compliance within corporate information systems. Central to its functionality are penetration testing, system evaluations, and compliance oversight. These components provide a comprehensive view of security across the entire IT infrastructure while also offering detailed insights at the departmental, host, and application levels, delivering essential information that facilitates the swift identification of vulnerabilities and the prevention of potential attacks. Additionally, MaxPatrol streamlines the process of maintaining an updated inventory of IT assets. It allows users to access details regarding network resources—including network addresses, operating systems, and available applications and services—while also identifying the hardware and software in operation and tracking the status of updates. Remarkably, it monitors changes within the IT infrastructure without missing a beat, detecting new accounts and hosts as they emerge and adapting to updates in hardware and software. Data regarding the security status of the infrastructure is continuously gathered and analyzed, ensuring that organizations have the insights necessary to maintain robust security protocols. This proactive approach not only enhances security awareness but also empowers teams to respond effectively to emerging threats.

lakeFS

Treeverse

See Software

lakeFS allows you to control your data lake similarly to how you manage your source code, facilitating parallel pipelines for experimentation as well as continuous integration and deployment for your data. This platform streamlines the workflows of engineers, data scientists, and analysts who are driving innovation through data. As an open-source solution, lakeFS enhances the resilience and manageability of object-storage-based data lakes. With lakeFS, you can execute reliable, atomic, and versioned operations on your data lake, encompassing everything from intricate ETL processes to advanced data science and analytics tasks. It is compatible with major cloud storage options, including AWS S3, Azure Blob Storage, and Google Cloud Storage (GCS). Furthermore, lakeFS seamlessly integrates with a variety of modern data frameworks such as Spark, Hive, AWS Athena, and Presto, thanks to its API compatibility with S3. The platform features a Git-like model for branching and committing that can efficiently scale to handle exabytes of data while leveraging the storage capabilities of S3, GCS, or Azure Blob. In addition, lakeFS empowers teams to collaborate more effectively by allowing multiple users to work on the same dataset without conflicts, making it an invaluable tool for data-driven organizations.

Datafold

See Software

Eliminate data outages by proactively identifying and resolving data quality problems before they enter production. Achieve full test coverage of your data pipelines in just one day, going from 0 to 100%. With automatic regression testing across billions of rows, understand the impact of each code modification. Streamline change management processes, enhance data literacy, ensure compliance, and minimize the time taken to respond to incidents. Stay ahead of potential data issues by utilizing automated anomaly detection, ensuring you're always informed. Datafold’s flexible machine learning model adjusts to seasonal variations and trends in your data, allowing for the creation of dynamic thresholds. Save significant time spent analyzing data by utilizing the Data Catalog, which simplifies the process of locating relevant datasets and fields while providing easy exploration of distributions through an intuitive user interface. Enjoy features like interactive full-text search, data profiling, and a centralized repository for metadata, all designed to enhance your data management experience. By leveraging these tools, you can transform your data processes and improve overall efficiency.

Great Expectations

See Software

Great Expectations serves as a collaborative and open standard aimed at enhancing data quality. This tool assists data teams in reducing pipeline challenges through effective data testing, comprehensive documentation, and insightful profiling. It is advisable to set it up within a virtual environment for optimal performance. For those unfamiliar with pip, virtual environments, notebooks, or git, exploring the Supporting resources could be beneficial. Numerous outstanding companies are currently leveraging Great Expectations in their operations. We encourage you to review some of our case studies that highlight how various organizations have integrated Great Expectations into their data infrastructure. Additionally, Great Expectations Cloud represents a fully managed Software as a Service (SaaS) solution, and we are currently welcoming new private alpha members for this innovative offering. These alpha members will have the exclusive opportunity to access new features ahead of others and provide valuable feedback that will shape the future development of the product. This engagement will ensure that the platform continues to evolve in alignment with user needs and expectations.

Meltano

See Software

Meltano offers unparalleled flexibility in how you can deploy your data solutions. Take complete ownership of your data infrastructure from start to finish. With an extensive library of over 300 connectors that have been successfully operating in production for several years, you have a wealth of options at your fingertips. You can execute workflows in separate environments, perform comprehensive end-to-end tests, and maintain version control over all your components. The open-source nature of Meltano empowers you to create the ideal data setup tailored to your needs. By defining your entire project as code, you can work collaboratively with your team with confidence. The Meltano CLI streamlines the project creation process, enabling quick setup for data replication. Specifically optimized for managing transformations, Meltano is the ideal platform for running dbt. Your entire data stack is encapsulated within your project, simplifying the production deployment process. Furthermore, you can validate any changes made in the development phase before progressing to continuous integration, and subsequently to staging, prior to final deployment in production. This structured approach ensures a smooth transition through each stage of your data pipeline.

Metaphor

Metaphor Data

See Software

With automated indexing of warehouses, lakes, dashboards, and various components of your data ecosystem, Metaphor enhances data visibility by integrating utilization metrics, lineage tracking, and social popularity indicators to present the most reliable data to your audience. It fosters a comprehensive view of data and facilitates discussions about it across the organization, ensuring that everyone has access to crucial information. Engage with your clients by seamlessly sharing catalog artifacts, including documentation, directly within Slack. You can also tag meaningful conversations in Slack and link them to specific data points. This promotes collaboration by enabling the organic discovery of key terms and usage patterns, breaking down silos effectively. Discovering data throughout your entire stack becomes effortless, and you can create both technical documentation and user-friendly wikis that cater to non-technical stakeholders. Furthermore, you can provide direct support to users in Slack and leverage the catalog as a Data Enablement tool, streamlining the onboarding process for a more tailored user experience. Ultimately, this approach not only enhances data accessibility but also strengthens the overall data literacy within your organization.

rudol

$0

See Software

You can unify your data catalog, reduce communication overhead, and enable quality control for any employee of your company without having to deploy or install anything. Rudol is a data platform that helps companies understand all data sources, regardless of where they are from. It reduces communication in reporting processes and urgencies and allows data quality diagnosis and issue prevention for all company members. Each organization can add data sources from rudol's growing list of providers and BI tools that have a standardized structure. This includes MySQL, PostgreSQL. Redshift. Snowflake. Kafka. S3*. BigQuery*. MongoDB*. Tableau*. PowerBI*. Looker* (*in development). No matter where the data comes from, anyone can easily understand where it is stored, read its documentation, and contact data owners via our integrations.

Apache Airflow Integrations

The Apache Software Foundation

What Integrates with Apache Airflow?

DataHub

Stonebranch

DataBuck

Coursebox AI

Netdata

Sifflet

JAMS

Microsoft Purview

Ray

Dagster

Oxla

emma

intermix.io

IRI FieldShield

Prophecy

BentoML

Ascend

DQOps

Decube

ZenML

Kedro

Secoda

Yandex Data Proc

DoubleCloud

Tobiko

Stackable

Ardent

Apache Druid

AT&T Alien Labs Open Threat Exchange

CrateDB

Beats

IRI Voracity

Datakin

Google Cloud Composer

Amazon MWAA

Telmai

Chalk

Foundational

Orchestra

OpenMetadata

Zipher

Mode

IBM Databand

MaxPatrol

lakeFS

Datafold

Great Expectations

Meltano

Metaphor

rudol

Relevant Categories

Category Integrations