Getting Started (Data Engineers)
This is for data engineers that might be tasked with integrating Aloha into a scoring or prediction process such as a prediction web service, etc.
For the purpose of constructing prediction pipelines, there are four main components:
- Semantics give meaning to features that appear in models.
- Auditors translate raw model predictions to an output type easily consumable by the model caller.
- Factories produce models. Factories are parametrized by semantics and an auditor.
- Models make predictions.
Semantics
Models can be created directly in Aloha but the preferred way to create a model is to specify model definitions in JSON and use model factories to interpret the JSON to produce a model. A BMI feature in the JSON model definition might look like the following:
"703 * ${weight} / pow(${height}, 2)"
CompiledSemantics
To interpret this feature, a Semantics
instance is needed. In Aloha, semantics implementations are parametrized
on the model input type, typically indicated by the type parameter A
in the Aloha code. The most common—and
presumably the most powerful—type of semantics is the CompiledSemantics
, which performs code generation and
on-the-fly compilation at model creation time. The compiled semantics is responsible parsing and emitting code
for the entire feature definition (like the one for BMI above).
Notice in the feature above that there are two variables on which the BMI calculation is based:
height and weight. The compiled semantics in Aloha is modularized so that most of the machinery can
be reused but the interpretation of variables in a feature varies by domain. Therefore, there are
CompiledSemanticsPlugin
s for various input classes including CSV data, Protocol Buffer data, Avro data, etc.
Note: It is a rather heavyweight process to compile the features into functions. It is time and memory intensive. Once the models are all created by the model factory, it is recommended to not hold a reference to a model factory. This disclaimer is noted here because it is the CompiledSemantics that embeds the Scala compiler so it is the most expensive component.
Creating CompiledSemanticsPlugin for Avro
Compiling Avro definitions to Java classes via Avro’s avro-tools
generates classes that implement both
SpecificRecord and
GenericRecord. Aloha’s
CompiledSemanticsAvroPlugin
is based on GenericRecord
. While GenericRecord
’s interface is untyped, Aloha uses
a provided Avro Schema
to provide a typed interface. So as long as the data adheres to the schema, Aloha will have
type information it can use to ensure features are well-typed.
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import com.eharmony.aloha.semantics.compiled.plugin.avro.CompiledSemanticsAvroPlugin
// Get an Avro Schema from somewhere.
val schema: Schema = getSchema()
// Here we use the GenericRecord but a more specific class can be used if desired.
val plugin = CompiledSemanticsAvroPlugin[GenericRecord](schema)
Creating CompiledSemanticsPlugin for Protocol Buffers
Compiling Protocol Buffers via protoc
(2.4.1) yields Java classes that extend GeneratedMessage
.
Aloha accepts types extending GeneratedMessage
and can extract the type information without needing an
external schema. This can be accomplished via:
import com.eharmony.aloha.semantics.compiled.plugin.proto.CompiledSemanticsProtoPlugin
import com.eharmony.aloha.test.proto.TestProtoBuffs.TestProto // Some protobuf class
val plugin = CompiledSemanticsProtoPlugin[TestProto]
Creating CompiledSemanticsPlugin for CSV Data
To define a compiled semantics plugin for CSV Data, we need to specify a field name to type mapping. Field order isn’t defined here as it’s unnecessary at this point. That defined by the data itself. For more information, see the In-Depth Walkthrough.
import com.eharmony.aloha.semantics.compiled.plugin.csv.CompiledSemanticsCsvPlugin
import com.eharmony.aloha.semantics.compiled.plugin.csv.CsvTypes.{IntType, FloatOptionType}
val plugin = CompiledSemanticsCsvPlugin(
"height" -> IntType, // Required integer field
"weight" -> FloatOptionType // Optional float field
)
Creating CompiledSemantics from a CompiledSemanticsPlugin
Once a compiled semantics plugin is created, we need to pass it to CompiledSemantics
. This is rather easy. There
are a few additional things needed to do so:
- a class cache directory (optional, but suggested)
- a compiler instance
- imports to be injected into features: These will typically be determined by data scientists.
import com.eharmony.aloha.semantics.compiled.CompiledSemantics
import com.eharmony.aloha.semantics.compiled.compiler.TwitterEvalCompiler
// If provided, needs to be an existing directory.
val cacheDir: Option[java.io.File] = getCacheDir()
val compiler = TwitterEvalCompiler(classCacheDir = cacheDir)
// BasicFunctions is not necessary but it's like Aloha's Predef.
// It is advisable to import this. Additional import statements
// can be included to provide additional functionality, UDFs, etc.
val imports = Seq(
"com.eharmony.aloha.feature.BasicFunctions._"
)
// An implicit execution context is needs by the semantics.
import concurrent.ExecutionContext.Implicits.global
val semantics = CompiledSemantics(compiler, plugin, imports)
Auditor
Models are designed to return predictions in a data structure easy for the calling environment to consume. Internally, models may produce predictions in some primitive format like 32-bit floats, but when the predictions are returned by the model, they are boxed into some kind of container. This is because models may have failures, in which case a prediction may be impossible. To differentiate between valid predictions and failures, Aloha DOES NOT use sentinel or default values for primitive types.
Auditors are the component responsible for boxing scores into some kind of container. Since models in Aloha can be nested to form hierarchical models, the auditors need to handle this recursive structure. Therefore, auditors are deal with trees of data including the prediction itself as well as additional information about missing features, errors messages and other diagnostic information.
The three main auditors are
Many more are possible, for instance, there’s a TreeAuditor
in aloha-core
’s src/test/scala
directory
that encodes scores in trees.
Creating an AvroScoreAuditor
import com.eharmony.aloha.audit.impl.avro.AvroScoreAuditor
// Specify the type parameter. Could be something else beside Double
val optAuditor: Option[AvroScoreAuditor[Double]] = AvroScoreAuditor[Double]
val auditor = optAuditor.get
Creating a Protocol Buffer ScoreAuditor
import com.eharmony.aloha.audit.impl.proto.ScoreAuditor
// Specify the type parameter. Could be something else beside Double
val optAuditor: Option[ScoreAuditor[Double]] = ScoreAuditor[Double]
val auditor = optAuditor.get
Creating an OptionAuditor
import com.eharmony.aloha.audit.impl.OptionAuditor
// Specify the type parameter. Could be something else beside Float
val auditor = OptionAuditor[Double]()
ModelFactory
Model factories are provide the mechanism to parse an external JSON model definition. To create, we need a
semantics and auditor as explained above. Once we have these, we can create model factory very easily.
ModelFactory.defaultFactory
finds all model definitions on the class path (via class path scanning)
and allows these types of models to be parsed.
import com.eharmony.aloha.factory.ModelFactory
val modelFactory = ModelFactory.defaultFactory(semantics, auditor)
Models
Models are the components that make predictions. They are easy to create, given a model factory:
import com.eharmony.aloha.audit.impl.OptionAuditor
import com.eharmony.aloha.semantics.compiled.CompiledSemantics
import concurrent.ExecutionContext.Implicits.global
import com.eharmony.aloha.semantics.compiled.plugin.csv.CompiledSemanticsCsvPlugin
import com.eharmony.aloha.semantics.compiled.plugin.csv.CsvTypes.IntType
import com.eharmony.aloha.semantics.compiled.plugin.csv.CsvLines
val modelFile = {
import java.io.File
val alohaRoot = new File(new File(".").getAbsolutePath.replaceFirst("/aloha/.*$", "/aloha/"))
new File(alohaRoot, "aloha-core/src/test/resources/fizzbuzz.json")
}
val compiler = TwitterEvalCompiler(classCacheDir = cacheDir)
val imports = Seq("com.eharmony.aloha.feature.BasicFunctions._", "scala.math._")
val plugin = CompiledSemanticsCsvPlugin("profile.user_id" -> IntType)
val semantics = CompiledSemantics(compiler, plugin, imports)
val auditor = OptionAuditor[Double]()
val modelFactory = ModelFactory.defaultFactory(semantics, auditor)
Get Model from ModelFactory
scala> // There are many other methods that retrieve models from InputStreams, Apache VFS, etc.
| val modelTry = modelFactory.fromFile(modelFile) // wrapped in a scala.util.Try
modelTry: scala.util.Try[com.eharmony.aloha.models.Model[com.eharmony.aloha.semantics.compiled.plugin.csv.CsvLine,Option[Double]]] = Success(<function1>)
scala> val model = modelTry.get
model: com.eharmony.aloha.models.Model[com.eharmony.aloha.semantics.compiled.plugin.csv.CsvLine,Option[Double]] = <function1>
Predict with Model
Models extend Scala’s Function1
trait so they have an apply
method. Since .apply
can
be dropped in Scala code, we can call the predict function very easily:
val Seq(x1, x2) = CsvLines(Map("profile.user_id" -> 0))(Seq("1", "2"))
scala> val y1 = model(x1)
y1: Option[Double] = Some(1.0)
scala> val y2 = model(x2)
y2: Option[Double] = Some(2.0)