Getting Started (Data Scientists)

This section will walk through downloading and building Aloha as well as doing some simple tasks in the command line interface (CLI) like generating datasets, creating models and using models to make predictions.

Build Prerequisites

Aloha uses SBT for building.

Installing SBT on a Mac

The preferred way to install SBT on a Mac is by using the Homebrew package manager. If you have Homebrew installed, just install via:

brew install sbt

Installing Homebrew on a Mac

If you don’t have Homebrew installed, follow the instructions at http://brew.sh.

Installing SBT Manually

Download it here.

Get Aloha from Source

git clone git@github.com:eHarmony/aloha.git
cd aloha

Installing to Ivy Local Cache

This can be accomplished via one of the following. Choose the most appropriate for you.

sbt ++2.11.8 clean publishLocal  # For Scala 2.11 version
sbt ++2.10.5 clean publishLocal  # For Scala 2.10 version
sbt +clean +publishLocal         # For all Scala versions

Installing to Maven Local Repository

Similarly, you can publish to you local Maven repository via one of the following.

sbt ++2.11.8 clean publishM2  # For Scala 2.11 version
sbt ++2.10.5 clean publishM2  # For Scala 2.10 version
sbt +clean +publishM2         # For all Scala versions

Generating a Local Copy of the Documentation

Prerequisites

Aloha uses sbt-microsites to generate site documentation. To create locally, you’ll need Jekyll. There are a bunch of ways to get Jekyll, but the easiest is with one of the following:

gem install jekyll        # Via ruby gems installer
yum install jekyll        # via Yum, typically on RedHat and variants
apt-get install jekyll    # via apt-get, typically on Debian and variants

Site Generation

sbt makeMicrosite && jekyll serve -s docs/target/site -d docs/target/site/_site

Then you should be able to open a browser to http://localhost:4000/aloha/ to see the documentation. To stop the webserver provided by Jekyll, Just press ctrl-c.

Create a VW Dataset

NOTE: CLI examples require running sbt test:compile prior to execution. This is necessary because Aloha no longer publishes a CLI uberjar. Therefore, we want the Aloha classfiles to be on the classpath.

eHarmony uses Vowpal Wabbit for many of its predictive tasks so Aloha provides support for VW including dataset creation and native VW model creation on the JVM.

bash script

# The classpath argument contains the protocol buffer definitions.
#
aloha-cli/bin/aloha-cli                                  \
  -cp 'aloha-io-proto/target/scala-2.11/test-classes'    \
  --dataset                                              \
  -s $(find $PWD/aloha-cli/src -name 'proto_spec2.json') \
  -p 'com.eharmony.aloha.test.proto.Testing.UserProto'   \
  -i $(find $PWD/aloha-core/src -name 'fizz_buzzs.proto')\
  --vw_labeled /tmp/dataset.vw

/tmp/dataset.vw

1 1| name=Alan gender=MALE bmi:23 |photos num_photos:2 avg_photo_height
1 1| name=Kate gender=FEMALE bmi=UNK |photos num_photos avg_photo_height:3

Let’s break down the preceding example. -cp was specified to provide additional classpath elements. This argument is currently required can be '' if no additional classpath elements are required. In this example, we included the directory that contains the necessary protocol buffer classfiles for com.eharmony.aloha.test.proto.Testing.UserProto.

After the -cp flag always comes the subtask. Here the subtask is creating a dataset, denoted by the --dataset flag. To see additional subtasks available, you can omit the subtask flag and the subsequent parameters and the Aloha CLI will provide sensible context-sensitive error messages.

-s tells where to get the Aloha feature specification file that will be used to extract data from the protocol buffer instances provided as the raw data. This file can be local or remote and is an Apache VFS URL.

-p along with the canonical class name tells that the input type of the raw data is line-separated base64-encoded protocol buffer input.

-i tells where the raw input data resides. If the -i argument is omitted, the CLI reads from standard input. This is nice when you want to pipe input to the Aloha CLI from another process. Again, if supplied, Aloha expects the argument to be and Apache VFS URL.

Finally, --vw_labeled tells Aloha to generated a labeled VW dataset and output it to tmp/dataset.vw. Like with other *nix commands, you can provide - as a value and the CLI will output to standard out.

Create a VW dataset programmatically

import java.io.File
import scala.util.Try
import scala.concurrent.ExecutionContext.Implicits.global
import com.google.protobuf.GeneratedMessage
import com.eharmony.aloha.reflect.RefInfo
import com.eharmony.aloha.dataset.{RowCreator, RowCreatorProducer, RowCreatorBuilder}
import com.eharmony.aloha.dataset.vw.labeled.VwLabelRowCreator
import com.eharmony.aloha.semantics.compiled.CompiledSemantics
import com.eharmony.aloha.semantics.compiled.compiler.TwitterEvalCompiler
import com.eharmony.aloha.semantics.compiled.plugin.proto.CompiledSemanticsProtoPlugin
import com.eharmony.aloha.test.proto.Testing.UserProto
import com.eharmony.aloha.dataset.MissingAndErroneousFeatureInfo

def getRowCreator[T <: GeneratedMessage : RefInfo, S <: RowCreator[T]](
    producers: List[RowCreatorProducer[T, S]], 
    alohaJsonSpecFile: File,
    alohaCacheDir: Option[File] = None): Try[S] = {
  val plugin = CompiledSemanticsProtoPlugin[T]
  val compiler = TwitterEvalCompiler(classCacheDir = alohaCacheDir)
  val imports: Seq[String] = Nil // Imports for UDFs to use in extraction functions.
  val semantics = CompiledSemantics(compiler, plugin, imports)
  val specBuilder = RowCreatorBuilder(semantics, producers)

  // There are many other factory methods for various input types: VFS, Strings, etc.
  specBuilder.fromFile(alohaJsonSpecFile)
}

val alohaRoot = new File(new File(".").getAbsolutePath.replaceFirst("/aloha/.*$", "/aloha/"))
val myAlohaJsonSpecFile =
  new File(alohaRoot, "aloha-cli/src/test/resources/com/eharmony/aloha/cli/dataset/proto_spec2.json")

// No cache dir in this example.
val alohaCacheDir: Option[File] = None

// Try[VwLabelRowCreator[UserProto]]
val creatorTry = getRowCreator(List(new VwLabelRowCreator.Producer[UserProto]),
                               myAlohaJsonSpecFile,
                               alohaCacheDir)

// Throws if the creator wasn't produced.  This might be desirable for
// short-circuiting in initialization code.
val creator = creatorTry.get

// Create a dataset row with an instance of UserProto.
// In real usage, get a real sequence.
val users: Seq[UserProto] = Nil

val results: Seq[(MissingAndErroneousFeatureInfo, CharSequence)] =
  users.map(u => creator(u))

Dataset types

Aloha currently supports creating the following types of datasets:

vw, vw_labeled, vw_cb, libsvm, libsvm_labeled, csv

Unlabeled VW datasets using the --vw flag
Labeled VW datasets using the --vw_labeled flag
Contextual Bandit VW datasets using the --vw_cb flag
Unlabeled LIBSVM using the --libsvm flag
Labeled LIBSVM using the --libsvm_labeled flag
CSV datasets using the --csv flag

One can generate compatible dataset types simultaneously for the same dataset by including the dataset type flag along with the file to which the dataset should be output. To output to standard out, use the filename -. Note that only one dataset can be output to a given file. Outputting two or more datasets to the same output file has undefined behaviour.

Create a VW Model

Given our dataset in /tmp/dataset.vw, we can create a VW model with the normal procedure. For instance, to create a simple logistic regression model with all default parameters and one pass over the data, do:

vw -d /tmp/dataset.vw --link logistic --loss_function logistic --readable_model /tmp/model_readable.vw -f /tmp/model.vw

This creates the binary model /tmp/model.vw and a human-readable model /tmp/model_readable.vw.

Verifying the Model

vw -d /tmp/dataset.vw --loss_function logistic -i /tmp/model.vw -t

output

only testing
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = /tmp/dataset.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.310707 0.310707            1            1.0   1.0000   0.7329        6
0.296281 0.281855            2            2.0   1.0000   0.7544        6

finished run
number of examples per pass = 2
passes used = 1
weighted example sum = 2.000000
weighted label sum = 2.000000
average loss = 0.296281
best constant = 1.000000
best constant's loss = 0.313262
total feature number = 12

Creating an Aloha Model

To create an Aloha model, we need two things. The first is the specification file used to create the dataset. The second is the binary VW model. To build the model, we just use the CLI again.

bash script

aloha-cli/bin/aloha-cli                                       \
  -cp ''                                                      \
  --vw                                                        \
  --vw-args "--quiet -t"                                      \
  --spec $(find $PWD/aloha-cli/src -name 'proto_spec2.json')  \
  --model /tmp/model.vw                                       \
  --name "test-model"                                         \
  --id 101                                                    \
| tee /tmp/aloha-vw-model.json

This prints to STDOUT JSON similar to the following. The following has rearranged key-value pairs and added whitespace. Under the JSON specification, this is equivalent to the JSON printed by the above command.

/tmp/aloha-vw-model.json

{
  "modelType": "VwJNI",
  "modelId": { "id": 101, "name": "test-model" },
  "features": {
    "name":   { "spec": "ind(${name})",   "defVal": [["=UNK", 1]] },
    "gender": { "spec": "ind(${gender})", "defVal": [["=UNK", 1]] },
    "bmi":    { "spec": "${bmi}",         "defVal": [["=UNK", 1]] },
    "num_photos": "${photos}.size",
    "avg_photo_height": "{ val hs = ${photos.height};  hs.flatten.sum / hs.filter(_.nonEmpty).size }"
  },
  "namespaces": {
    "photos": [
      "num_photos",
      "avg_photo_height"
    ]
  },
  "vw": {
    "model": "[ base64-encoded data omitted ]",
    "params": "--quiet -t"
  }
}

Inline Models Cannot Exceed 2GB

Since the JVM arrays are indexed by 32-bit integers, and JVM Strings are backed by arrays, the largest String available able on the JVM is 2^(32 - 1) - 1 or about 2 billion characters. Consequently, large base-64 encoded binary models that are embedded into the JSON will cause failures. These may be easy or hard to find. For instance, one of the more subtle errors may be attempting to create an array of negative size. This would most likely arise from a large model whose size in bytes is a number larger than a 32-bit integer can represent.

To overcome this, specify the --external flag when creating a model and provide an Apache VFS URL to the model. When the Aloha model is parsed, the underlying machine learned model resource will be retrieved using this URL.

Aloha Model Prediction via CLI

To perform predictions using the command line interface, one needs to use the --modelrunner flag. There are many options for this. The most import here are:

-A which adds the input after the predictions. Since no separator was provided, TAB is used to separate the predictions and the input data. It’s easy to change the separator using the --outsep flag.
--output-type is another important flag which determines the output type of the model. This is important to get right because types may be coerced and this could render weird results. For instance, if using a model to predict probabilities and the output type is an integral type, the values returned will likely all be 0. This is because coercion from real-valued to integrally valued numbers in done by dropping the decimal places. If the value is in the interval [0, 1), then it will be truncated to 0.
--imports adapts the DSL by importing JVM code.
-p flag says that base64-encoded protocol buffers will be used as the input format.

bash script

cat $(find $PWD/aloha-core/src -name 'fizz_buzzs.proto')               \
| aloha-cli/bin/aloha-cli                                              \
  -cp "$PWD/aloha-io-proto/target/scala-2.11/test-classes"             \
  --modelrunner                                                        \
  --output-type Double                                                 \
  -A                                                                   \
  --imports "scala.math._,com.eharmony.aloha.feature.BasicFunctions._" \
  -p "com.eharmony.aloha.test.proto.Testing.UserProto"                 \
  /tmp/aloha-vw-model.json

output

0.7329283952713013	CAESBEFsYW4YASUAALhBKg0IARABGQAAAAAAAPA/Kg0IAhACGQAAAAAAAABA
0.7543833255767822	CAESBEthdGUYAioNCAMQAxkAAAAAAAAIQA==

If you don’t want the input, just omit the -A flag or pipe and process the data elsewhere.

Sanity checking the Aloha model

Notice that the outputs here are:

0.7329283952713013
0.7543833255767822

We saw in the section Verifying the Model, the that predictions were:

0.7329
0.7544

Which lines up. It appears our model is working!

Future plans

The two main areas of future development are

Adding broad classes of IO types suck as Thrift, Scala case classes, etc.
Adding additional ML libraries underneath the Aloha model facade.

Ways to extend to ML libraries not natively supported

One can think of this as being analogous to hadoop streaming. Aloha can be integrated with other platforms by using it for feature transformation and dataset production. This is an easy path for the data scientist as it can alleviate the burden on extracting and transforming features, especially when extract values from Protocol buffers or Avro.