跳转到主要内容

标签(标签)

资源精选(342) Go开发(108) Go语言(103) Go(99) angular(82) LLM(75) 大语言模型(63) 人工智能(53) 前端开发(50) LangChain(43) golang(43) 机器学习(39) Go工程师(38) Go程序员(38) Go开发者(36) React(33) Go基础(29) Python(24) Vue(22) Web开发(20) Web技术(19) 精选资源(19) 深度学习(19) Java(18) ChatGTP(17) Cookie(16) android(16) 前端框架(13) JavaScript(13) Next.js(12) 安卓(11) typescript(10) 资料精选(10) NLP(10) 第三方Cookie(9) Redwoodjs(9) LLMOps(9) Go语言中级开发(9) 自然语言处理(9) 聊天机器人(9) PostgreSQL(9) 区块链(9) mlops(9) 安全(9) 全栈开发(8) ChatGPT(8) OpenAI(8) Linux(8) AI(8) GraphQL(8) iOS(8) 软件架构(7) Go语言高级开发(7) AWS(7) C++(7) 数据科学(7) whisper(6) Prisma(6) 隐私保护(6) RAG(6) JSON(6) DevOps(6) 数据可视化(6) wasm(6) 计算机视觉(6) 算法(6) Rust(6) 微服务(6) 隐私沙盒(5) FedCM(5) 语音识别(5) Angular开发(5) 快速应用开发(5) 提示工程(5) Agent(5) LLaMA(5) 低代码开发(5) Go测试(5) gorm(5) REST API(5) 推荐系统(5) WebAssembly(5) GameDev(5) CMS(5) CSS(5) machine-learning(5) 机器人(5) 游戏开发(5) Blockchain(5) Web安全(5) Kotlin(5) 低代码平台(5) 机器学习资源(5) Go资源(5) Nodejs(5) PHP(5) Swift(5) 智能体(4) devin(4) Blitz(4) javascript框架(4) Redwood(4) GDPR(4) 生成式人工智能(4) Angular16(4) Alpaca(4) SAML(4) JWT(4) JSON处理(4) Go并发(4) kafka(4) 移动开发(4) 移动应用(4) security(4) 隐私(4) spring-boot(4) 物联网(4) nextjs(4) 网络安全(4) API(4) Ruby(4) 信息安全(4) flutter(4) 专家智能体(3) Chrome(3) CHIPS(3) 3PC(3) SSE(3) 人工智能软件工程师(3) LLM Agent(3) Remix(3) Ubuntu(3) GPT4All(3) 软件开发(3) 问答系统(3) 开发工具(3) 最佳实践(3) RxJS(3) SSR(3) Node.js(3) Dolly(3) 移动应用开发(3) 编程语言(3) 低代码(3) IAM(3) Web框架(3) CORS(3) 基准测试(3) Go语言数据库开发(3) Oauth2(3) 并发(3) 主题(3) Theme(3) earth(3) nginx(3) 软件工程(3) azure(3) keycloak(3) 生产力工具(3) gpt3(3) 工作流(3) C(3) jupyter(3) 认证(3) prometheus(3) GAN(3) Spring(3) 逆向工程(3) 应用安全(3) Docker(3) Django(3) R(3) .NET(3) 大数据(3) Hacking(3) 渗透测试(3) C++资源(3) Mac(3) 微信小程序(3) Python资源(3) JHipster(3) 大型语言模型(2) 语言模型(2) 可穿戴设备(2) JDK(2) SQL(2) Apache(2) Hashicorp Vault(2) Spring Cloud Vault(2) Go语言Web开发(2) Go测试工程师(2) WebSocket(2) 容器化(2) AES(2) 加密(2) 输入验证(2) ORM(2) Fiber(2) Postgres(2) Gorilla Mux(2) Go数据库开发(2) 模块(2) 泛型(2) 指针(2) HTTP(2) PostgreSQL开发(2) Vault(2) K8s(2) Spring boot(2) R语言(2) 深度学习资源(2) 半监督学习(2) semi-supervised-learning(2) architecture(2) 普罗米修斯(2) 嵌入模型(2) productivity(2) 编码(2) Qt(2) 前端(2) Rust语言(2) NeRF(2) 神经辐射场(2) 元宇宙(2) CPP(2) 数据分析(2) spark(2) 流处理(2) Ionic(2) 人体姿势估计(2) human-pose-estimation(2) 视频处理(2) deep-learning(2) kotlin语言(2) kotlin开发(2) burp(2) Chatbot(2) npm(2) quantum(2) OCR(2) 游戏(2) game(2) 内容管理系统(2) MySQL(2) python-books(2) pentest(2) opengl(2) IDE(2) 漏洞赏金(2) Web(2) 知识图谱(2) PyTorch(2) 数据库(2) reverse-engineering(2) 数据工程(2) swift开发(2) rest(2) robotics(2) ios-animation(2) 知识蒸馏(2) 安卓开发(2) nestjs(2) solidity(2) 爬虫(2) 面试(2) 容器(2) C++精选(2) 人工智能资源(2) Machine Learning(2) 备忘单(2) 编程书籍(2) angular资源(2) 速查表(2) cheatsheets(2) SecOps(2) mlops资源(2) R资源(2) DDD(2) 架构设计模式(2) 量化(2) Hacking资源(2) 强化学习(2) flask(2) 设计(2) 性能(2) Sysadmin(2) 系统管理员(2) Java资源(2) 机器学习精选(2) android资源(2) android-UI(2) Mac资源(2) iOS资源(2) Vue资源(2) flutter资源(2) JavaScript精选(2) JavaScript资源(2) Rust开发(2) deeplearning(2) RAD(2)
SEO Title

A curated list of notable ETL (extract, transform, load) frameworks, libraries and software.

Related Lists

Workflow Management/Engines

  • Airflow - "Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed."
  • Azkaban - "a batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy to use web user interface to maintain and track your workflows."
  • Dray.it - "Docker workflow engine. Allows users to separate a workflow into discrete steps each to be handled by a single container."
  • Luigi - "a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in."
  • Mara Pipelines - "A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow"
  • Pinball - "a scalable workflow management platform developed at Pinterest. It is built based on layered approach."
  • prefect - "a new workflow management system, designed for modern infrastructure and powered by the open-source Prefect Core workflow engine. Users organize Tasks into Flows, and Prefect takes care of the rest."
  • TaskFlow - "allows the creation of lightweight task objects and/or functions that are combined together into flows (aka: workflows) in a declarative manner. It includes engines for running these flows in a manner that can be stopped, resumed, and safely reverted."
  • Toil - Similar to Luigi, jobs are classes with a run method. Supports executing jobs on other machines (workers) which can include AWS spot instances.
  • Argo - Container based workflow management system for Kubernetes. Workflows are specified as a directed acyclic graph (DAG), and each step is executed on a container, and the latter is run on a Kubernetes Pod. There is also support for Airflow DAGs.
  • Dagster - "Dagster is a data orchestrator for machine learning, analytics, and ETL. It lets you define pipelines in terms of the data flow between reusable, logical components, then test locally and run anywhere. With a unified view of pipelines and the assets they produce, Dagster can schedule and orchestrate Pandas, Spark, SQL, or anything else that Python can invoke."

Job Scheduling

  • Chronos - "a distributed and fault-tolerant scheduler that runs on top of Apache Mesos that can be used for job orchestration."
  • Dagobah - "a simple dependency-based job scheduler written in Python. Dagobah allows you to schedule periodic jobs using Cron syntax. Each job then kicks off a series of tasks (subprocesses) in an order defined by a dependency graph you can easily draw with click-and-drag in the web interface."
  • Jenkins - "the leading open-source automation server. Built with Java, it provides over 1000 plugins to support automating virtually anything, so that humans can actually spend their time doing things machines cannot."

Java

  • GETL - Groovy toolbox for ETL Tasks from practicing architectures
  • JSR 352 - Java native API for batch processing
  • Scriptella - Java-XML ETL toolbox for every day use.
  • Spring Batch - ETL on Spring ecosystem

Python

Libraries

  • BeautifulSoup - Popular library used to extract data from web pages.
  • Blaze - "translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems."
  • Bonobo - Simple, modern and atomic data transformation graphs for Python 3.5+.
  • Celery - "an asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well."
  • Dask - Ever tried using Pandas to process data that won't fit into memory? Dask makes it easy. Dask also has functionality to make it easy to processing continuous streams of data.
  • dataset - A wrapper around SQLAlchemy that simplifies database operations (including upserting).
  • ijson - Allows processing JSON iteratively (as a stream) without loading the whole file into memory at once.
  • Joblib - "a set of tools to provide lightweight pipelining in Python."
  • lxml - Parses XML using C libraries libxml2 and libxslt, so it's very fast. Also supports a "recover" mode that will try its best to use invalid xml or discard it. Great for large XML files and advanced functionality (like using xpaths). IBM also has a great article on high-performance parsing with lxml here: http://www.ibm.com/developerworks/library/x-hiperfparse/
  • MrJob - "lets you write MapReduce jobs in Python 2.6+ and run them on several platforms. The easiest route to writing Python programs that run on Hadoop."
  • Odo - Moves data across containers (SQL, CSV, MongoDB, Pandas, etc). Claims to be the easiest and fastest way to load a CSV into your database.
  • Pandas - Implements dataframes in Python for easier data processing and includes a number of tools that make it easier to extract data from multiple file formats.
  • parse - The opposite of Python's format(). Easier to use than regex, but more limited.
  • PETL - "a general purpose Python package for extracting, transforming and loading tables of data." Slower than Pandas and not as good for larger amounts of data, but simpler.
  • PyQuery - Extracts data from web pages with a jquery-like syntax.
  • Retrying - Allows you to add a decorator to any function/method to retry on an exception.
  • Requests-HTML - Combines PyQuery, Requests, parse, and other libraries for a pleasant and intuitive web scraping experience.
  • riko - A python stream processing engine modeled after Yahoo! Pipes.
  • Ruffus - "The Ruffus module is a lightweight way to add support for running computational pipelines."
  • SQLAlchemy - "the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL."
  • Toolz - "A functional standard library for python." Includes a pipe function that allows you to pipe a value through a sequence of functions. There's also a cython implementation here: https://github.com/pytoolz/cytoolz
  • xmltodict - Makes working with XML as easy as working with JSON. Also allows streaming so you don't run out of memory on large XML files. Great for simple operations on small XML files.

Talks/Articles

Ruby

  • Kiba - "Data processing & ETL framework for Ruby"
  • nokogiri - an excellent XML parser that "just works"
  • Sequel - "The Database Toolkit for Ruby"
  • Embulk - "a parallel bulk data loader that helps data transfer between various storages, databases, NoSQL and cloud services."

Go

  • Benthos - "The stream processor for mundane tasks."
  • Crunch - "A fast to develop, fast to run, Go based toolkit for ETL and feature extraction on Hadoop."
  • Pachyderm - A system for running processing pipeline jobs in containers and version controlling all data using a commit-based distributed filesystem.

Javascript

  • Datapumps - "Use pumps to import, export, transform or transfer data."
  • NoFlo - "a JavaScript implementation of Flow-Based Programming"

Talks/Articles

Cloud Services

  • Alteryx - Cloud ETL tool with an interface similar to GUI ETL tools.
  • AWS Data Pipeline - "a web service that helps you reliably process and move data between different AWS compute and storage services, as well as on-premise data sources, at specified intervals."
  • AWS Glue - AWS Glue generates the code (using Python and Spark) to execute your data transformations and data loading processes.
  • Amazon Simple Workflow Service (SWF) - "helps developers build, run, and scale background jobs that have parallel or sequential steps. You can think of Amazon SWF as a fully-managed state tracker and task coordinator in the Cloud."
  • AWS Batch - Allows executing jobs as containerized applications running on Amazon ECS. Also includes features for dynamically bidding for Spot Instances, integration with existing workflow engines, scheduling, monitoring, dependency modeling, and dynamic scaling/provisioning based on amount of work.
  • Google Dataflow - "Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines."
  • Cloud Data Fusion - "Fully managed, cloud-native data integration platform."
  • Stitch - "Stitch is a cloud-first, open source platform for rapidly moving data. A simple, powerful ETL service, Stitch connects to all your data sources – from databases like MySQL and MongoDB, to SaaS applications like Salesforce and Zendesk – and replicates that data to a destination of your choosing."

Big Data (Hadoop Stack)

  • Pig - "a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs."
  • Spark - "a fast and general-purpose cluster computing system. It provides high-level APIs in Scala, Java, and Python that make parallel jobs easy to write, and an optimized engine that supports general computation graphs. It also supports a rich set of higher-level tools including Shark (Hive on Spark), MLlib for machine learning, GraphX for graph processing, and Spark Streaming."

ETL Tools (GUI)

Warning: If you're already familiar with a scripting language, GUI ETL tools are not a good replacement for a well structured application written with a scripting language. These tools lack flexibility and are a good example of the "inner-platform effect". With a large project, you will most likely run into instances where "the tool doesn't do that" and end up implementing something hacky with a script run by the GUI ETL tool. Also, the GUI can conceal complexity and the files these tools generate are impossible to code review. However, the GUI and out-of-the-box functionality can make some tasks simpler, especially for people not comfortable with writing code.

  • Apache NiFi - "a rich, web-based interface for designing, controlling, and monitoring a dataflow."
  • Informatica PowerCenter - "a toolset for establishing and maintaining enterprise-wide data warehouses. It has a customer base of over 5,000 companies."
  • Microsoft SSIS - "a component of the Microsoft SQL Server database software that can be used to perform a broad range of data migration tasks."
  • Pentaho Kettle - The most popular open-source graphical ETL tool.
  • Talend - "an open source application for data integration job design with a graphical development environment"
  • N8n - "Free and open fair-code licensed node based Workflow Automation Tool. Easily automate tasks across different services."
  • CDAP - "Use Cask Data Application Platform to visually build and manage data applications in hybrid and multi-cloud environments."

原文:https://github.com/pawl/awesome-etl

标签