Constructing Datasets via CLI

Prerequisites

A quick note: All examples will assume you’ve check out Aloha from GitHub and built it from source. This is so that you can reuse data in the testing code. You can download and build via:

git clone git@github.com:eHarmony/aloha.git
cd aloha

sbt +clean +publishLocal
sbt ++2.10.5 clean publishLocal
sbt ++2.11.8 clean publishLocal

sbt +clean +publishM2
sbt ++2.10.5 clean publishM2
sbt ++2.11.8 clean publishM2

Get CLI Jar and aloha-cli script

In reality, you’ll probably not want to download the source and build from scratch to use Aloha to create a dataset. You can get a few different ways. For instance:

If you have the Aloha source code, just do an mvn clean install and look for the aloha-cli-[X.Y.Z]-jar-with-dependencies.jar in $HOME/.m2/repository/com/eharmony/aloha-cli/[X.Y.Z]/aloha-cli-[X.Y.Z]-jar-with-dependencies.jar
If you are using a project that has an Aloha Maven dependency, aloha-cli-[X.Y.Z]-jar-with-dependencies.jar may already be present at $HOME/.m2/repository/com/eharmony/aloha-cli/[X.Y.Z]/aloha-cli-[X.Y.Z]-jar-with-dependencies.jar
If you are not currently using Aloha in a project but just want the CLI functionality, you can download the aloha-cli-[X.Y.Z]-jar-with-dependencies.jar from Maven Central. Just click here, then click the jar-with-dependencies.jar link to download the latest Jar file with all of the dependencies included.

Copy the CLI script to some place your system path

curl -fsSL https://raw.githubusercontent.com/eHarmony/aloha/master/aloha-cli/bin/aloha-cli

Getting Acquainted

A Look at the CLI

Input

aloha-cli/bin/aloha-cli

Output

usage: aloha-cli -cp /path/to/some.jar:/path/to/other.jar:... [args to CLI]

So, it’s clear that some jar files need to be specified on the classpath to make the CLI work. Luckily, when aloha is built, the aloha-cli module has a jar that includes all of the necessary dependencies. This can be found automatically with the following shell script magic which looks in the target directory of the aloha-cli module for a jar with dependencies.

Let’s try running again with the proper jar on the classpath.

Input

aloha-cli/bin/aloha-cli                                    \
  -cp $(find aloha-cli -name "*jar-with-dependencies.jar")

Output

No arguments supplied. Supply one of: '--dataset', '--modelrunner', '--vw'.

Now the CLI gets a little further. Let’s choose the --dataset option (since we’re making a dataset).

Input

aloha-cli/bin/aloha-cli                                    \
  -cp $(find aloha-cli -name "*jar-with-dependencies.jar") \
  --dataset

Output

Error: Missing option --spec
Error: No output dataset type provided.  Provide at least one of: vw, vw_labeled, vw_cb, libsvm, libsvm_labeled, csv
dataset [ SOME ALOHA VERSION HERE ]
Usage: dataset [options]

  --cachedir <value>
        a cache directory
  --parallel <value>
        a list of Apache VFS URLs additional jars to be included on the classpath
  -s <value> | --spec <value>
        Apache VFS URL to a JSON specification file containing attributes of the dataset being created.
  -p <value> | --proto-input <value>
        canonical class name of the protocol buffer type to use.
  -c <value> | --csv-input <value>
        Apache VFS URL to JSON file specifying the structure of the CSV input.
  -i <value> | --in <value>
        Apache VFS URL to the input file.  If not supplied, STDIN will be used.
  --vw <value>
        produce an unlabeled VW dataset and place the output in the specified location.
  --vw_labeled <value>
        produce a labeled VW dataset and place the output in the specified location.
  --vw_cb <value>
        produce a contextual bandit VW dataset and place the output in the specified location.
  --libsvm <value>
        produce an unlabeled LIBSVM dataset and place the output in the specified location.
  --libsvm_labeled <value>
        produce a labeled LIBSVM dataset and place the output in the specified location.
  --csv <value>
        produce a CSV dataset and place the output in the specified location.
  --csv-headers
        Produce headers in CSV output.
  --csv-header-file <value>
        Write CSV headers to the designated file.

So, now we’re getting somewhere. Now that we have the lay of the land, let’s create some small datasets for real.

Examples

CSV Input, Transformed CSV output

Command and Results

Input

# Creating an Aloha cache directory provides a speed up 
# when running multiple times.  
#
# The cache directory must exist if it's specified.
# 
mkdir -p /tmp/aloha-cache 2>/dev/null

# Read 2 rows of data from STDIN.
(
cat <<EOM
MALE,175,
FEMALE,,books|films|chinese food
EOM
) |\
aloha-cli/bin/aloha-cli                               \
  -cp $(find aloha-cli -name "*.jar" | grep dep)      \
  --dataset                                           \
  --cachedir /tmp/aloha-cache                         \
  -c $(find $PWD/aloha-core/src -name 'csv_types1.js')\
  -s $(find $PWD/aloha-core/src -name 'csv_spec2.js') \
  --csv -

Output

MALE,170,0
FEMALE,NULL,3

Files

csv_types1.js


{
  "fs": ",",
  "ifs": "|",
  "missingData": "NULL",
  "errorOnOptMissingField": false,
  "errorOnOptMissingEnum": false,
  "columns": [
    { "name": "gender", "type": "enum",
      "className": "a.b.Gender", 
      "values": [ "MALE", "FEMALE" ] },
    { "name": "weight", "type": "int", 
      "optional": true },
    { "name": "likes", "type": "string", 
      "vectorized": true }
  ]
}

csv_spec2.js


{
  "separator": ",",
  "nullValue": "NULL",
  "encoding": "regular",
  "imports":[],
  "features": [
    {"name": "gender", "spec":"${gender}" },
    {"name": "weight", "spec":"${weight} / 10 * 10" },
    {"name": "num_likes", "spec":"${likes}.size" }
  ]
}

Let’s look at the CLI arguments and the file structure of the auxiliary we provied to examine what happened. We supplied:

--cachedir /tmp/aloha-cache: This is useful for avoiding recompilation of Aloha features across calls to the CLI. Try to specify a cache directory when possible. The directory must exist.
-c $(find $PWD/aloha-core/src -name 'csv_types1.js') describes the structure of the CSV INPUT data used to construct the dataset. This isn’t necessary when using Protocol Buffer input (-p flag).
-s $(find $PWD/aloha-core/src -name 'csv_spec2.js') describes the OUTPUT format of the dataset.
--csv - This option says that we want to create a CSV dataset. The - at the end of the option indicates that the output be directed to standard output (STDOUT).

External CSV input format description

csv_types1.js is a JSON file describing the CSV input structure of the original data we are transforming. These files have 6 fields.

fs: The associated value is a string describing the field separator. For instance, with comma-delimited data, the value would be ","; for tab-separated data, it would be "\t".
ifs: The intra-field separator. This is used when fields are vectorized to separate the values in a vectorized data within a column.
missingData: This is the string which represents a missing value. For instance, NULL was used in the above example. If we set the second line in the input in the above example to FEMALE,NULL,books|films|chinese food, everything still works.
errorOnOptMissingField: a Boolean telling whether to produce an error for the row when an optional field is missing. Currently, an error in an output line causes the line not to included in the resulting dataset, but the dataset creation process is not halted. false is recommended since it’s more forgiving; it allows rows with a missing value to be included in the dataset.
errorOnOptMissingEnum: Similar to errorOnOptMissingField, this field tells whether to produce an error when an unknown value for an enumerated type is provided in the input data.
columns: An array of column definitions. The order in which the columns appear in the array defined the expected column order in the CSV input.

In the columns field, each associated value in the array is a JSON object.

name: This is the name of the field. This is important not just for documentation but for use in the output specification as well.
type: one of { boolean, double, enum, float, int, long, string }
optional: a Boolean value. true if the column’s value might NOT appear in the input data. The default when not provided is false.
vectorized: a Boolean value. true if the column’s contains zero or more values. In case this is set to true, the ifs value is used to split the values in the column into a vector of values. When vectorized is not provided, its default value is false.

Protocol Buffer input, Vowpal Wabbit output

This example will show off a bunch of features not covered in the CSV Input, Transformed CSV output example. We’ll see how to used Protocol Buffer input data and output to a different output type, namely Vowpal Wabbit. We’ll also see how to specify an input and output file, rather relying on STDIN and STDOUT.

Let’s start by looking at the fields in the data that we’ll be working with in this example. To do so, just cat the User.proto file:

Input

cat aloha-core/src/test/proto/User.proto

Output

package com.eharmony.aloha.test.proto;

option java_outer_classname="Testing";

enum GenderProto {
    MALE   = 1;
    FEMALE = 2;
}

message PhotoProto {
    required int64  id = 1;
    optional int32 height = 2;
    optional double aspect_ratio = 3;
}

message UserProto {
    required int64 id = 1;
    optional string name = 2;
    optional GenderProto gender = 3;
    optional float bmi = 4;
    repeated PhotoProto photos = 5;
}

Now let’s create some data and then process it.

Input

Create temporary input and output files:

INFILE=$(mktemp -t example_input)
OUTFILE=$(mktemp -t example_output)

Fill the input file with base64-encoded UserProto data. This is the data that will be provided to the data scientist for analysis. Aloha is used to decode and transform with serialized data.

(
cat <<EOM
CAESBEFsYW4YASUAALhBKg0IARABGQAAAAAAAPA/Kg0IAhACGQAAAAAAAABA
CAESBEthdGUYAioNCAMQAxkAAAAAAAAIQA==
EOM
) >> $INFILE

Run the CLI. Note we need the second entry in the -cp (classpath) flag because it is the jar that contains the protocol buffer definitions.

aloha-cli/bin/aloha-cli                              \
  -cp $(find aloha-cli -name "*.jar" | grep dep):\
$(find aloha-core -name "*.jar" | grep test)         \
  --dataset                                          \
  -i $INFILE                                         \
  -s $(find $PWD/aloha-core/src -name 'proto_spec1.js')   \
  -p com.eharmony.aloha.test.proto.Testing.UserProto \
  --cachedir /tmp/aloha-cache                        \
  --vw $OUTFILE

Clean up temp files and print the output.

rm -f $INFILE
cat $OUTFILE
rm -f $OUTFILE

Output

| name=Alan gender=MALE bmi:23 num_photos:2
| name=Kate gender=FEMALE bmi=UNK num_photos

Again, let’s look at the CLI arguments. We’ll only describe the parameters that differ from the CSV input example:

-i $INFILE This is an input file containing the data.
-p com.eharmony.aloha.test.proto.Testing.UserProto Tells the CLI that the input will be protocol buffer data that will have the structure described by the com.eharmony.aloha.test.proto.Testing.UserProto class.
--vw $OUTFILE output an unlabeled vowpal wabbit dataset and put it in the file pointed to by the $OUTFILE shell variable. Obviously, there’s no need for a variable here because we are just executing a shell command. The key is that the value containing the output file location is an Apache VFS URL.

(Output) specification files

All CLI-based dataset creation involves a specification file for the output type. In the CSV example, we used csv_spec2.js. In the Protocol Buffer example, we used proto_spec1.js. Note that these files have the same features array format, but some of the other fields vary. This is because there are innate differences in the output type of columns in each row emitted by the dataset creator. For instance, each column of a CSV dataset is a scalar value. This means the vector of columns is a dense vector format. In the case of vowpal wabbit output, the covariate data is inherently sparse in nature so key-value pairs are outputted.

Common fields

imports an array of strings. These imports that will be imported into scope for eeach feature definition. Wildcard imports are specified with an underscore. For instance, a common import one should consider is "com.eharmony.aloha.feature.BasicFunctions._"
features an array containing JSON objects which describe the features to be produced. More on this later.

CSV-specific fields

separator the column delimiter. "," for comma-delimited data, "\t" for tab-delimited.
nullValue the string value to assign to a column when its data is missing. "" or "NULL", for instance.
encoding how categorical variables are encoded. Currently, this can be regular or hotOne. Regular encoding just creates a one-dimensional string representation of the value. Hot-one encoding is a binarized vector representation of the feature.

features

Each feature has three values:

name the name of the feature
spec how to compute the feature (scala code with the imports in the imports array in scope)
defVal a default value. This field is optional. If omitted, the nullValue value will be used if necessary in the CSV case and an empty sequence of key-value pairs will be used in the case of sparse formats like VW and LIBSVM.

A word of sparse formats

Each feature in a sparse dataset type outputs a sequence of key-value pairs. This can be zero or more. If the sequence sizes for each feature are exactly zero or one, the same specification file can be used for both dense (CSV) or sparse (VW, LIBSVM) dataset creation. By importing com.eharmony.aloha.feature.BasicFunctions._, a conversion mechanism is brought into scope. So when specifying a feature like:

{ 
  "name": "some_feature_name", 
  "spec": "${some.extracted.scalar.number}"
}

and assuming a datum passed into the dataset CLI has a value of 7 associated with the field some.extracted.scalar.number, Aloha can automatically generate a sequence of one key-value pair: [ "some_feature_name" -> 7) ].

Multiple dataset formats

Additionally, the dataset creator ignores JSON fields in the specification file that aren’t used by the desired dataset type. This means that if we provide a union of fields used by multiple dataset creators, we can those to generate two or more datasets of different types at once. We just need to make sure the destination files differ. For instance:

Create the files again (like above):

INFILE=$(mktemp -t example_input)
OUTFILE_CSV=$(mktemp -t example_output)
OUTFILE_VW=$(mktemp -t example_output)

Create the protocol buffer input again (like above):

(
cat <<EOM
CAESBEFsYW4YASUAALhBKg0IARABGQAAAAAAAPA/Kg0IAhACGQAAAAAAAABA
CAESBEthdGUYAioNCAMQAxkAAAAAAAAIQA==
EOM
) >> $INFILE

Run the CLI. Here we create two datasets simultaneously from the same protocol buffer input data and the same specification file, csv_AND_vw_spec.js. We give headers to the CSV file. Everything else is the same.

aloha-cli/bin/aloha-cli                                     \
  -cp $(find aloha-cli -name "*.jar" | grep dep):\
$(find aloha-core -name "*.jar" | grep test)                \
  --dataset                                                 \
  -i $INFILE                                                \
  -s $(find $PWD/aloha-core/src -name 'csv_AND_vw_spec.js') \
  -p com.eharmony.aloha.test.proto.Testing.UserProto        \
  --cachedir /tmp/aloha-cache                               \
  --vw_labeled  $OUTFILE_VW                                 \
  --csv $OUTFILE_CSV                                        \
  --csv-headers

cat $OUTFILE_CSV

Output

csv_label,gender,bmi,num_photos,avg_photo_height
1,0,23.0,2,1
1,1,NULL,1,3

cat $OUTFILE_VW

Output

1 1|ignored csv_label |personal bmi:23 |photos num_photos:2 avg_photo_height
1 1|ignored csv_label |personal gender |photos num_photos avg_photo_height:3