embulk

Installing Maven-style Embulk plugins

Since Embulk v0.11.0 was released a year ago, we have pushed the new Maven-style Embulk plugins rather than the legacy RubyGems-style plugins.

See also: Embulk v0.11 is coming soon: JRuby

We recently started to provide a couple of methods to install the Maven-style Embulk plugins more easily, which was not very easy in the beginning of Maven-style plugins, indeed.

This article is a brief introduction of the new plugin installation methods.

Revisit: Embulk home

Embulk now has a concept of the “Embulk home” directory, which is a directory to contain embulk.properties and Embulk plugin installations. The Maven-style Embulk plugins will also be installed in the Embulk home directory.

See again: Embulk v0.11 is coming soon: Embulk home

#1: Embulk’s built-in subcommand install

Embulk v0.11.3 introduced a new Embulk subcommand: embulk install, instead of embulk gem install for RubyGems-style plugins. This subcommand takes a Maven artifact notation as its argument. The example below installs org.embulk:embulk-input-s3:0.6.0 from Maven Central.

$ java -jar embulk-0.11.3.jar install "org.embulk:embulk-input-s3:0.6.0"
...
...
2024-06-13 15:46:11.537 +0900 [INFO] (main): The path "/home/user/.embulk/lib/m2/repository" (m2_repo) does not exist. Creating it as a directory.
2024-06-13 15:46:11.619 +0900 [INFO] (main): No alternative remote Maven repositories are specified. Downloading artifacts from Maven Central.
2024-06-13 15:46:11.633 +0900 [INFO] (main): Downloading org.embulk:embulk-input-s3:pom:0.6.0 from https://repo.maven.apache.org/maven2
2024-06-13 15:46:12.725 +0900 [INFO] (main): Downloaded org.embulk:embulk-input-s3:pom:0.6.0 at /home/user/.embulk/lib/m2/repository/org/embulk/embulk-input-s3/0.6.0/embulk-input-s3-0.6.0.pom
2024-06-13 15:46:12.776 +0900 [INFO] (main): Downloading com.amazonaws:aws-java-sdk-s3:pom:1.11.466 from https://repo.maven.apache.org/maven2
2024-06-13 15:46:13.027 +0900 [INFO] (main): Downloaded com.amazonaws:aws-java-sdk-pom:pom:1.11.466 at /home/user/.embulk/lib/m2/repository/com/amazonaws/aws-java-sdk-pom/1.11.466/aws-java-sdk-pom-1.11.466.pom
...
...
2024-06-13 15:46:14.857 +0900 [INFO] (main): Downloading org.embulk:embulk-input-s3:jar:0.6.0 from https://repo.maven.apache.org/maven2
2024-06-13 15:46:14.857 +0900 [INFO] (main): Downloading com.amazonaws:aws-java-sdk-s3:jar:1.11.466 from https://repo.maven.apache.org/maven2
...
...
2024-06-13 15:46:15.720 +0900 [INFO] (main): Downloaded org.embulk:embulk-input-s3:jar:0.6.0 at /home/user/.embulk/lib/m2/repository/org/embulk/embulk-input-s3/0.6.0/embulk-input-s3-0.6.0.jar
2024-06-13 15:46:15.721 +0900 [INFO] (main): Downloaded com.amazonaws:aws-java-sdk-s3:jar:1.11.466 at /home/user/.embulk/lib/m2/repository/com/amazonaws/aws-java-sdk-s3/1.11.466/aws-java-sdk-s3-1.11.466.jar
...
...
2024-06-13 15:46:15.730 +0900 [INFO] (main): Installed org.embulk:embulk-input-s3:jar:0.6.0 at /home/user/.embulk/lib/m2/repository/org/embulk/embulk-input-s3/0.6.0/embulk-input-s3-0.6.0.jar
2024-06-13 15:46:15.730 +0900 [INFO] (main): Installed com.amazonaws:aws-java-sdk-s3:jar:1.11.466 at /home/user/.embulk/lib/m2/repository/com/amazonaws/aws-java-sdk-s3/1.11.466/aws-java-sdk-s3-1.11.466.jar
...
...

This subcommand downloads also the dependencies of the specified Maven artifact transitively as you can see in the example above.

Note that you can change the destination Embulk home directory by Embulk’s standard options. See the example below.

$ java -jar embulk-0.11.3.jar -Xembulk_home=/tmp/foo install "org.embulk:embulk-input-s3:0.6.0"
...

$ env EMBULK_HOME=/tmp/bar java -jar embulk-0.11.3.jar install "org.embulk:embulk-input-s3:0.6.0"
...

It now supports only Maven Central as the remote repository, unfortunately.

#2: Out-of-Embulk Embulk plugin installer

Embulk has had the mkbundle subcommand and the -b option so that users can maintain plugin installations by Gemfile, but it works only for RubyGems-style plugins, of course.

The Gradle org.embulk.runset plugin is an alternative for Maven-style Embulk plugin. It works out of the Embulk package at all.

To use this, set up an environment for Gradle at first. Gradle 8.7 is at least required. You may want to choose the Gradle wrapper in typical use-cases.

Next, write build.gradle to declare the Maven-based Embulk plugins you wanted to install.

plugins {
    id "org.embulk.runset" version "0.2.0"  // Just apply this Gradle plugin.
}

repositories {
    mavenCentral()
}

installEmbulkRunSet {
    // Set your Embulk home directory (absolute path) to install the Embulk plugins and "embulk.properties".
    embulkHome file("/home/user/my-embulk-home")

    // Specify the Maven-style Embulk plugin by the "artifact" directive.
    artifact "org.embulk:embulk-input-s3:0.6.0"

    // You can specify multiple versions of the same Embulk plugin so that you can choose the version at runtime.
    // You can also specify an artifact with the split-style notation.
    artifact group: "org.embulk", name: "embulk-input-s3", version: "0.5.3"

    // Specify this if you need JRuby.
    // It downloads jruby-complete-9.1.15.0.jar, and set the "jruby" Embulk System Property in "embulk.properties".
    jruby "org.jruby:jruby-complete:9.1.15.0"

    // Specify this if you need to set some Embulk System Properties manually.
    // It sets the "key" Embulk System Property to "value" in "embulk.properties".
    embulkSystemProperty "key", "value"
}

Then, run gradle installEmbulkRunSet (./gradlew when you use the Gradle wrapper) to set up.

$ ./gradlew installEmbulkRunSet

> Configure project :
Supplied embulkHome "/home/user/my-embulk-home" does not exist, then will be created.
Setting to copy org.embulk:embulk-input-s3:0.6.0:jar into org/embulk/embulk-input-s3/0.6.0
Setting to copy com.amazonaws:aws-java-sdk-s3:1.11.466:jar into com/amazonaws/aws-java-sdk-s3/1.11.466
...
...
Setting to copy org.embulk:embulk-input-s3:0.5.3:jar into org/embulk/embulk-input-s3/0.5.3
...
...
Setting to copy org.embulk:embulk-input-s3:0.5.3:pom into org/embulk/embulk-input-s3/0.5.3
...
...
Setting to copy org.jruby:jruby-complete:9.1.15.0:jar into org/jruby/jruby-complete/9.1.15.0

BUILD SUCCESSFUL in 2s
1 actionable task: 1 executed

The Embulk System Properties file embulk.properties is automatically generated in the specified Embulk home, too.

#Generated by the "org.embulk.embulk-runset" Gradle plugin.
#Thu Jun 13 16:53:31 JST 2024
key=value
jruby=file\:///home/user/my-embulk-home/lib/m2/repository/org/jruby/jruby-complete/9.1.15.0/jruby-complete-9.1.15.0.jar

Run!

In either style of installation, you can run Embulk with the installed Maven-style Embulk plugins.

See the example s3_with_maven.yml below.

in:
  # The full-style type notation for Maven-style Embulk plugins.
  type:
    source: maven
    group: org.embulk
    name: s3
    version: 0.6.0
  bucket: ...
  parser:
    type: csv
    ...
out:
  type: stdout

Then, run Embulk!

$ java -jar embulk-0.11.4.jar -Xembulk_home=/home/user/my-embulk-home run s3_with_maven.yml
2024-06-13 17:01:55.373 +0900 [INFO] (main): embulk_home is set from command-line: /home/user/my-embulk-home
2024-06-13 17:01:55.378 +0900 [INFO] (main): m2_repo is set as a sub directory of embulk_home: /home/user/my-embulk-home/lib/m2/repository
2024-06-13 17:01:55.378 +0900 [INFO] (main): gem_home is set as a sub directory of embulk_home: /home/user/my-embulk-home/lib/gems
2024-06-13 17:01:55.378 +0900 [INFO] (main): gem_path is set empty.
2024-06-13 17:01:55.378 +0900 [DEBUG] (main): Embulk system property "default_guess_plugin" is set to: "gzip,bzip2,json,csv"
2024-06-13 17:01:55.634 +0900 [INFO] (main): Started Embulk v0.11.4
2024-06-13 17:01:55.811 +0900 [INFO] (0001:transaction): Loaded plugin embulk-input-s3 (maven:org.embulk:s3:0.6.0)
...
...
2024-06-13 17:01:55.948 +0900 [INFO] (0001:transaction): Loaded plugin embulk-output-stdout
...
...
2024-06-13 17:01:56.052 +0900 [INFO] (0001:transaction): Loaded plugin embulk-parser-csv
...
...
2024-06-13 17:01:56.691 +0900 [INFO] (0001:transaction): Start listing file with prefix [******]
2024-06-13 17:01:57.577 +0900 [INFO] (0001:transaction): Found total [1] files
2024-06-13 17:01:57.721 +0900 [INFO] (0001:transaction): Using local thread executor with max_threads=16 / output tasks 8 = input tasks 1 * 8
2024-06-13 17:01:57.759 +0900 [INFO] (0001:transaction): {done:  0 / 1, running: 0}
...
...
1,foo
2,bar
3,baz
2024-06-13 17:01:58.602 +0900 [INFO] (0001:transaction): {done:  1 / 1, running: 0}
2024-06-13 17:01:58.603 +0900 [INFO] (0001:transaction): Incremental job, setting last_path to [******.csv]
2024-06-13 17:01:58.618 +0900 [INFO] (0001:transaction): Embulk system property "plugins.output.stdout" is not set.
2024-06-13 17:01:58.619 +0900 [INFO] (0001:transaction): Embulk system property "plugins.default.output.stdout" is not set.
2024-06-13 17:01:58.621 +0900 [INFO] (main): Committed.
2024-06-13 17:01:58.629 +0900 [INFO] (main): Next config diff: {"in":{"last_path":"******.csv"},"out":{}}

We hope those installation methods will help you.