Model Formats

Prerequisites

Common Required JSON Fields

  • modelType is a string defining the type of model the JSON in the same object block represents. These values are predefined within the model definitions and won’t change without a major version change since Aloha uses semantic versioning. See a model description below to determine the associated modelType field.
  • modelId is a JSON object that acts as an identifier for models. It is represented by an object with two fields id, a 64-bit integer and name, a string.

Models Should be Immutable

NOTE: While Aloha doesn’t place any special importance of the uniqueness of these model IDs and doesn’t assign any special semantic meaning to these fields, Aloha users SHOULD place special importance on the uniqueness of the numeric id.

This is just a suggestion but it will pay huge dividends in terms of cleaner data collection:

Models should for practical purposes be immutable. This means that whenever a model is edited in such a way that the model functionally changes, the model’s modelId’s id field should be changed And all ancestor models’ modelId’s id field should change.

Model Types

Categorical distribution model

A Categorical distribution model returns pseudo-random values based on a designated Categorical distribution. It can be used for collecting random data on which future models can be trained. Its sampling is constant time, regardless of the number of values in the distribution.

(CD) JSON Fields

Field Name JSON Type JSON Subtype Required Default
modelType String N / A true N / A
modelId Object N / A true N / A
features Array String true N / A
probabilities Array Number true N / A
labels Array OUTPUT true N / A
missingOk Boolean N / A false false

(CD) JSON Field Descriptions

(CD) modelType

modelType field must be CategoricalDistribution.

(CD) features

features is an array of strings, where each string contains an Aloha feature specification. Each of these strings should just contain a variable definition. These features are used to create a hash on which the pseudo-randomness is based. It can be thought of as the statistical unit for the experiment being run.

An example value might be:

[ "${profile.user_id}", "1" ] 

where the second element, 1, acts as a hash salt. When using unique hash salts for every hash, one can avoid interaction in multiple hashes. For more information on salting including an example of issues that can arise, see Salts and Seeds in Testing in an Unsure World by Ryan Deak.

(CD) probabilities

probabilities is an array of non-negative numbers. These values, once normalized, will provide the probability of selecting one of the associated labels at the same index in the parallel labels array. While these values needn’t be normalized in the JSON, it may help to supply normalized values if exact normalized probabilities can be provided. If exact values cannot be provided, it’s OK to provide approximations, so long as the ratios are the same as the final normalized probability vector. For instance, the following are both perfectly fine, but I would opt for the second:

[ 1, 1, 1 ]
[ 0.33, 0.33, 0.33 ]

(CD) labels

labels are the actual values returned when a value is sampled from the distribution. The type of value in labels is dependent on the model’s output type. For instance, if the model output type is a string, then the JSON value type in labels will be strings. If the output type is a 64-bit float, the JSON type in the labels array will be Number.

(CD) missingOk

missingOk determines whether the model should return a value when at least one of the values in features could not be determined. If false, no score will be returned. If true, then a constant will be used in place of the feature value and will be incorporated into the hash.

(CD) JSON Examples

This example samples values { 1, 2, 3, 4 } based on a user ID and the number of days since Jan 1, 1970. This means that for any day, regardless of how many times the model is called, the same user will always get the same value. This value can however change from day to day.

{
  "modelType": "CategoricalDistribution",
  "modelId": { "id": 3, "name": "model with E[X] = 3" },
  "features": [ "${profile.id}", "${calculated_values.days_since_epoch}", "1" ],
  "probabilities": [ 0.1, 0.2, 0.3, 0.4 ] ,
  "labels": [ 1, 2, 3, 4 ],
  "missingOk": false
}

Imagine a situation where Alice and Bob are talking to each other and each of them is equally likely to start a conversation with the other person. But with a one percent probability Manny will intercept the message and will forward his own message to the person from whom the message did not originate. This model can probabilistically model the sender of the message that was received by either Alice or Bob.

{
  "modelType": "CategoricalDistribution",
  "modelId": { "id": 1, "name": "man-in-the-middle-model" },
  "features": [ "${conversation.id}", "2" ],
  "probabilities": [ 0.495, 0.495, 0.01 ] ,
  "labels": [ "Bob", "Alice", "Manny" ],
  "missingOk": true
}

Constant model

A constant model always returns the same value.

(C) JSON Fields

Field Name JSON Type JSON Subtype Required Default
modelType String N / A true N / A
modelId Object N / A true N / A
value OUTPUT N / A true N / A

(C) JSON Field Descriptions

(C) modelType

modelType field must be Constant.

(C) value

value field’s type is dependent on the model’s output type.

(C) JSON Examples

Always return 1. Whether 1 is a 32-bit integer, 64-bit integer, 32-bit or 64-bit float is dependent on the output type of the model.

{
  "modelType": "Constant",
  "modelId": { "id": 1, "name": "model that always returns 1" },
  "value": 1
}

Always return "awesome". This model will parse assuming the output type of the model is String.

{
  "modelType": "Constant",
  "modelId": { "id": 1, "name": "model that always returns 'awesome'" },
  "value": "awesome"
}

Decision tree model

Decision trees are what one might expect from a decision tree: they encode the ability to return different values based on predicates that hold for a given input datum. Aloha encodes trees as a graph-like structure so that the code could be extended to DAGs in a straightforward way. The first node in the nodes array is considered to be the root node.

(DT) JSON Fields

Field Name JSON Type JSON Subtype Required Default
modelType String N / A true N / A
modelId Object N / A true N / A
returnBest Boolean N / A true N / A
missingDataOk Boolean N / A true N / A
nodes Array Node* true N / A

(DT) JSON Field Descriptions

(DT) modelType

modelType field must be DecisionTree.

(DT) returnBest

returnBest is a Boolean that when true let’s the decision tree return a value associated with an internal node when no further progress down the tree can be made.

When set to false, the model returns an error rather than a score.

Usually, in a decision tree, the inability to produce a path from the tree root to a leaf is indicative of a problem. The most common problem being that the branching logic at a given node is not exhaustive. This will happen when the tree algorithm checks if it can proceed to any of its children and the predicates associated with each child return false.

(DT) missingDataOk

missingDataOk is a Boolean telling whether missing data is OK. In the event that missingDataOk is set to false, when a variable in a node selector is missing, the node selector should stop and report to the decision tree that it can’t make a decision on which child to traverse. The subsequent behavior is dictated by returnBest.

If missingDataOk is true, then when a node selector encounters missing data, it can still recover. For instance, when a linear node selector encounters missing data in one of the predicates, it will assume the predicate evaluates to false and will continue on to the next predicate. This behavior may vary across node selectors of different types.

(DT) nodes

nodes contains the list of nodes in the decision tree. As was mentioned above, trees are encoded in a tabular way to provide future extensibility to decision DAGs. The first node in the nodes array is considered to be the root node. The structure of a node is as follows:

Field Name JSON Type JSON Subtype Required Default
id Number N / A true N / A
value OUTPUT N / A true N / A
selector Node Selector* N / A false N / A
(Node) id

id is the node id and is used by node selectors to determine which child to traverse. Note that this is independent of the model’s id.

(Node) value

value in nodes are the values potentially return by the decision tree when a given node is selected.

(Node) selector

selector is the data type responsible for determining which child to traverse. It is required for internal nodes but should not be included for leaf nodes. For the different type of node selectors, see the Node Selectors section.

Node Selectors

Node selectors are responsible for determining which child in the tree to traverse. Currently, the following types of node selectors exist:

Linear Node Selector

Linear node selectors work like an IF-THEN-ELSE statement. Children are ordered and a predicate is associated with each child. A linear search is performed to find the first predicate that yields a value of true. If no predicate yields true, then the decision tree split is said to be not exhaustive. In this case, the returnBest field of the decision tree determines the subsequent behavior. This selector algorithm acts in O(N) time where N is node’s the number of children.

Field Name JSON Type JSON Subtype Required Default
selectorType String N / A true N / A
children Array Number true N / A
predicates Array String true N / A
(linear ns) selectorType

selectorType must be linear.

(linear ns) children

children is a list of node id values. This array is parallel to the predicates array. If index i in predicates is the first predicate yielding true, then the value at index i in children will contain the id of the node that will be visited.

(linear ns) predicates

The predicates array contains strings where each string is an Aloha feature specification representing a Boolean function. This array is parallel to the children array. If index i in predicates is the first predicate yielding true, then the value at index i in children will contain the id of the node that will be visited.

Random Node Selector

A random node selector provides a way of pseudo-randomly splitting traffic between one or more outcomes. This selector algorithm acts in O(1) time regardless of the number of children.

Field Name JSON Type JSON Subtype Required Default
selectorType String N / A true N / A
children Array Number true N / A
probabilities Array Number true N / A
features Array String true N / A
(random ns) selectorType

selectorType must be random.

(random ns) children

children is a list of node id values. This array is parallel to the probabilities array. Samples drawn from the Categorical distribution backed by probabilities yield an index into the children array. The id at the chosen index in the children array is used to select the child to traverse.

Note: There is a special case when the length of children can be one less than the length of the probabilities array. In such a case, if returnBest is set to true then when the sampling from the categorical distribution selects last index in probabilities, the current node (and not a child node) is selected. See JSON examples for how this is useful.

(random ns) probabilities

probabilities is a (possibly unnormalized) probability vector. This array is parallel to the children array. Samples drawn from the Categorical distribution backed by probabilities yield an index into the children array. The id at the chosen index in the children array is used to select the child to traverse.

While probabilities will be normalized and it’s not strictly required to provide a normalized probabilities in probabilities, it is most likely more readable to provide normalized probabilities as long as the ratios can be encoded exactly. For instance

[ 0.33, 0.33, 0.33 ]

may be preferable to

[ 1, 1, 1 ]

What should not be done is to compromise the ratios for readability. For instance, if we want to encode probabilities of 1/3, we should NOT encode them as:

[ 0.33, 0.34, 0.33]
(random ns) features

features is an array of strings, where each string contains an Aloha feature specification. Each of these strings should just contain a variable definition. These features are used to create a hash on which the pseudo-randomness is based. It can be thought of as the statistical unit for the experiment being run.

An example value might be:

[ "${profile.user_id}", "1" ] 

where the second element, 1, acts as a hash salt. When using unique hash salts for every hash, one can avoid interaction in multiple hashes. For more information on salting including an example of issues that can arise, see Salts and Seeds in Testing in an Unsure World by Ryan Deak.

(DT) JSON Examples

(DT) One Node Tree Example

The simplest tree would be a one node tree that is equivalent to a Constant model:

{
  "modelType": "DecisionTree",
  "modelId": {"id": 0, "name": "one node tree"},
  "returnBest": false,
  "missingDataOk": false,
  "nodes": [ { "id": 1, "value": 1 } ]
}

(DT) Three Node Tree Example

A basic tree with one split can be encoded as follows (deeper trees can just repeat the pattern). It will output the string “short” when the height is less than 66 (presumably inches). If height is greater than or equal to 66, “tall” is returned. Note that the value field is related to the model output type so this model will only parse when the model output type is String.

{
  "modelType": "DecisionTree",
  "modelId": {"id": 0, "name": "height decision tree"},
  "returnBest": false,
  "missingDataOk": false,
  "nodes": [ 
    { 
      "id": 1, 
      "value": "This value won't be returned b/c returnBest and missingDataOk are both false.",
      "selector": { 
        "selectorType": "linear",
        "predicates": [ "${profile.height} < 66", "true" ],
        "children": [ 2, 3 ]
      }
    },
    { "id": 2, "value": "short" },
    { "id": 3, "value": "tall" }
  ]
}

(DT) Random Branching Example

In the following example, we have a randomized split based on user id where 90% of the time, the value 1 is returned and 10% of the time, -1 is returned. If the user ID is missing, the output value will be 0. This is just for illustration purposes. Usually, you’ll want to add a constant salt to features like [ "${profile.user_id}", "1" ].

{
  "modelType": "DecisionTree",
  "modelId": {"id": 0, "name": "90/10 random split"},
  "returnBest": true,
  "missingDataOk": false,
  "nodes": [ 
    { 
      "id": 1, 
      "value": 0,
      "selector": { 
        "selectorType": "random",
        "features": [ "${profile.user_id}" ],
        "children": [ 2, 3 ],
        "probabilities": [ 0.9, 0.1 ]
      }
    },
    { "id": 2, "value": 1 },
    { "id": 3, "value": -1 }
  ]
}

(DT) Short-Circuit Random Branching Example

As was mentioned above, you could have a situation where you want to do something randomly with some probability by making the children array 1 element shorter than the probabilities array. Let’s say that we want to do the same thing as in the previous example but are willing to give users missing a user ID a negative one. This allows us to essentially reuse the root node for randomization and missing data purposes. We can write that model as follows:

{
  "modelType": "DecisionTree",
  "modelId": {"id": 0, "name": "90/10 random split, user missing id get -1."},
  "returnBest": true,
  "missingDataOk": false,
  "nodes": [ 
    { 
      "id": 1, 
      "value": -1,
      "selector": { 
        "selectorType": "random",
        "features": [ "${profile.user_id}" ],
        "children": [ 2 ],
        "probabilities": [ 0.9, 0.1 ]
      }
    },
    { "id": 2, "value": 1 }
  ]
}

Double-to-Long model

The Double-to-Long model provides an affine transformation from Double values to Long values. It works by applying the following transformation:

v = scale × submodel value + translation
v′ = IF round THEN round(v) ELSE floor(v)
output = max(clampLower, min(v′, clampUpper))

This model is useful for eHarmony-specific purposes, but others may find it useful as well.

(DtL) JSON Fields

Field Name JSON Type JSON Subtype Required Default
modelType String N / A true N / A
modelId Object N / A true N / A
submodel Model* N / A true N / A
scale Number N / A false 1
translation Number N / A false 0
clampLower Number N / A false -∞
clampUpper Number N / A false
round Boolean N / A false false

(DtL) JSON Field Descriptions

(DtL) modelType

modelType field must be DoubleToLong.

(DtL) submodel

submodel must either be a JSON object containing an Aloha model or it can be a JSON object with exactly one field, import whose associated value is a JSON string containing an Apache VFS URL. For instance:

{ "import": "/path/to/aloha/model.json" }

Either way, the submodel is expected to be a model that takes the same input type as the Double-to-Long model itself and the submodel should have an output type of Double. The output to of the Double-to-Long model is obviously expected to be a 64-bit long-valued integer.

(DtL) scale

scale is the multiplier in the affine transformation. It can be any 64-bit float. When not supplied, the default value will be 1.

(DtL) translation

translation is the additive term in the affine transformation. It can be any 64-bit float. When not supplied, the default value will be 0.

(DtL) clampLower

clampLower is the lower clamp value. It is a 64-bit long-valued int. When not supplied, the default value will be -9,223,372,036,854,775,808.

(DtL) clampUpper

clampUpper is the upper clamp value. It is a 64-bit long-valued int. When not supplied, the default value will be 9,223,372,036,854,775,807.

(DtL) round

round is a Boolean that determines whether the affine-transformed value should be rounded (true) or floored (false).

(DtL) JSON Examples

(DtL) Minimal Example

A double-to-long model whose submodel is a constant model returning 5.5. The outer model floors 5.5 to the 64-bit long value 5.

{
  "modelType": "DoubleToLong",
  "modelId": { "id": 0, "name": "" },
  "submodel": {
    "modelType": "Constant",
    "modelId": { "id": 1, "name": "Constant model returning 5.5" },
    "value": 5.5
  }
}

(DtL) Full Example

A double-to-long model whose submodel is a constant model returning -13. The outer model multiplies by -0.5, adds 2, rounds, then takes the min of clampUpper (8) and 9, which comes out to 8.

{
  "modelType": "DoubleToLong",
  "modelId": { "id": 0, "name": "" },
  "clampLower": 6,
  "clampUpper": 8,
  "scale": -0.5,
  "translation": 2,
  "round": true,
  "submodel": {
    "modelType": "Constant",
    "modelId": { "id": 1, "name": "Constant model returning -13" },
    "value": -13
  }
}

Error model

An error model returns an error. This is not the same as throwing an exception. Error models return values.

(E) JSON Fields

Field Name JSON Type JSON Subtype Required Default
modelType String N / A true N / A
modelId Object N / A true N / A
errors Array String true Array[ "Error with unspecified reason." ]

(E) JSON Field Descriptions

(E) modelType

modelType field must be Error.

(E) errors

errors is an array of error strings to provide. This is an optional parameter. An empty array is permitted.

(E) JSON Examples

{
  "modelType": "Error",
  "modelId": { "id": 0, "name": "" },
}
{
  "modelType": "Error",
  "modelId": { "id": 0, "name": "" },
  "errors": [ "error 1", "error 2" ]
}

Error-swallowing model

Error swallowing model makes an attempt to trap exceptions and return them as errors instead. While this is not foolproof, it was tested against SchrodingerException, an exception specifically designed to be as harmful as possible. The implementation uses scala.util.Try which means errors that are caught are dictated by scala.util.control.NonFatal.

(ES) JSON Fields

Field Name JSON Type JSON Subtype Required Default
modelType String N / A true N / A
modelId Object N / A true N / A
submodel Model* N / A true N / A
recordErrorStackTraces Boolean N / A false true

(ES) JSON Field Descriptions

(ES) modelType

modelType field must be ErrorSwallowingModel.

(ES) submodel

submodel must either be a JSON object containing an Aloha model or it can be a JSON object with exactly one field, import whose associated value is a JSON string containing an Apache VFS URL. For instance:

{ "import": "/path/to/aloha/model.json" }

Either way, the submodel is expected to be a model with the same input and output type as the Error-swallowing model itself.

(ES) recordErrorStackTraces

recordErrorStackTraces is a boolean indicating whether to record stack traces in the returned error.

(ES) JSON Examples

{
  "modelType": "ErrorSwallowingModel",
  "modelId": { "id": 0, "name": "0" },
  "recordErrorStackTraces": true,
  "submodel": {
    "modelType": "Regression",
    "modelId": { "id": 1, "name": "1" },
    "features": {
      "evil_feature": "throw new RuntimeException"
    },
    "weights": {}
  }
}

Model decision tree model

Model decision tree models are just like Decision tree models exception that the value field must either be a JSON object containing an Aloha model or it can be a JSON object representing a model import. For instance:

{ "import": "/path/to/aloha/model.json" }

where import is a JSON string containing an Apache VFS URL.

The submodels, input, and output types are expected to be the same as the input and output type of the model decision tree model. The model works by using the decision tree algorithm to select a node containing a model, and then it applies the submodel it finds in the node to the same input data that was passed to the decision tree.

For more information, see the section on Decision tree models.

Regression model

The regression model implementation in Aloha is a sparse polynomial regression model. This is a superset of and therefore subsumes linear regression models.

(R) JSON Fields

Field Name JSON Type JSON Subtype Required Default
modelType String N / A true N / A
modelId Object N / A true N / A
features Object N / A true N / A
weights Object N / A true N / A
higherOrderFeatures Array Object false Empty Array
spline Array Object false Empty Array
numMissingThreshold Number N / A false

(R) JSON Field Descriptions

(R) modelType

modelType field must be Regression.

(R) features

features is the map of features that are included in the model. The keys in the features represent the feature names; the values can take on one of two forms. They can either be a String or a JSON object. If they are a JSON object, they must have the following format:

Field Name JSON Type JSON Subtype Required Default
spec String N / A true N / A
defVal Array Array (2 elements. 0: (String) feature name, 1: (Number) feature value) false Empty Array
(R) feature spec

spec contains an Aloha feature specification. These feature specifications produce functions whose input type is the same as the regression model’s input type and the output type is an iterable collection of (String, Double) key-value pairs. Specifically, the Scala type is scala.collection.Iterable[(String, Double)]. An example of such a spec string is:

"Seq((\"\", ${profile.height}))"

This creates one key-value pair whose key is the empty string and whose value is the height extract from a profile. One might ask, “why would the key be the empty string?” The answer is that the keys are prepended with the feature name. So, for instance, the above spec string might be included in a feature as follows:

{
  "modelType": "Regression",
  ...
  "features": {
    "height": { "spec": "Seq((\"\", ${profile.height}))" }
  }
  ...
}

Given the above model, let’s say the profile.height yields a value of 66. Then the feature key-value pairs produced would be: { ("height", 66) }.

There is a lot of supporting code available to the Aloha user when importing com.eharmony.aloha.feature.BasicFunctions._. Among these is an implicit conversion from numeric types to sequences of key-value pairs. This allows the user to write the previously mentioned Aloha function specification as:

"${profile.height}"

Aloha will take care of converting it to:

"Seq((\"\", ${profile.height}))"

assuming profile.height is an appropriate numeric type.

(R) feature defVal

NOTE: Ryan Deak, author of Aloha has expressed a personal distaste for the use of this feature. It can however create better models at the cost of readability and error reporting.

defVal is the value that is returned if the spec references a variable whose presence is optional and whose value is not present for the current model input. As with the Iterable of key-value pairs produced by the function created by spec, keys will have the feature name prepended in the final feature value (see below for an example).

defVal is a JSON Array of Arrays. The inner Array should contain a String followed by a Number. For instance, you may want to use something like the following:

[["=UNKNOWN", 1]]

eHarmony uses this in a lot of models. Let’s put this into context:

{
  "modelType": "Regression",
  ...
  "features": {
    "height": { "spec": "Seq((\"\", ${profile.height}))", "defVal": [["=UNKNOWN", 1]] }
  }
  ...
}

and imagine profile.height is an optional variable that is not present in the current model input. Then the model can’t produce the key-value for height feature and defaults to using defVal to produce: { ("height=UNKNOWN", 1) }. The reason we use the value 1 is that this is exactly the form of an indicator variable. That way, the regression model can have an associated weight in the β vector.

There is a consequence of providing a defVal. When the defVal field is provided and its value is the non-empty Array, this will affect the reporting of the number of missing features. For more information, see the numMissingThreshold section.

(R) spec (as a String)

As was mentioned above, features values in the regression model JSON can either be represented as a String or Object, the Object case is shown above. If you don’t provide a defVal in the above object, the specification Object can be replaced with just the spec value. For instance, the example from above:

{
  "modelType": "Regression",
  ...
  "features": {
    "height": { "spec": "Seq((\"\", ${profile.height}))" }
  }
  ...
}

could be rewritten as:

{
  "modelType": "Regression",
  ...
  "features": {
    "height": "Seq((\"\", ${profile.height}))"
  }
  ...
}

and with what we learned about imports, if com.eharmony.aloha.feature.BasicFunctions._ is imported, the JSON can be simplified further to:

{
  "modelType": "Regression",
  ...
  "features": {
    "height": "${profile.height}"
  }
  ...
}

This is easier to read and write and more aesthetically pleasing in general.

(R) weights

weights is a representation of the first-order feature weights in the regression β vector. By first-order, we mean terms of degree 1 in the polynomial expression which is being regressed in the model.

weights is a JSON Object where the keys are the same as the keys produced in the feature map. The values are the weights associated with the each feature value.

NOTE: Only if the feature values (the numbers in the feature key-value pairs) are standardized, can one infer feature importance from the magnitude of the weight values. This is because if the weights are not standardized, the scales of the feature values will differ and the weights will adapt to these scales.

An example of weights is:

{
  "modelType": "Regression",
  ...
  "weights": {
    "height": 1.23,
    "weight": -2.34
  }
  ...
}

(R) higherOrderFeatures

higherOrderFeatures encodes the weights associated with terms in the polynomial expression with degree greater than 1. This is optional and is used for polynomial regression models but is not necessary for linear regression models. It is an optional field. An example looks like:

{
  "modelType": "Regression",
  "features": {
    "m_ht_lt_63": "ind(${male.height} < 63)",
    "f_ht_gt_66": "ind(66 < ${female.height})",
  },
  ...
  "higherOrderFeatures": [
    { 
      "wt": -5.1, 
      "features": { 
        "m_ht_lt_63": ["m_ht_lt_63=true"], 
        "f_ht_gt_66": ["f_ht_gt_66=true"] 
      }
    },
    { 
      "wt": 1.2,
      "features": {
        "m_ht_lt_63": ["m_ht_lt_63=false"],
        "f_ht_gt_66": ["f_ht_gt_66=false"]
      }
    }
  ]
}
Field Name JSON Type JSON Subtype Required Default
wt Number N / A true N / A
features Object N / A true N / A

Given features with feature names “m_ht_lt_63” and “f_ht_gt_66”, where “m_ht_lt_63” is an indicator variable meaning male in a dyad has a height of less than 63 inches and “f_ht_gt_66” means the female’s height is greater than 66 inches, we see that when the male is on the shorter side and the female is taller, this has a disproportionally large negative effect versus when the male is either not that much shorter (≤ 3 inches) or is taller than the the female.

(R) higherOrderFeatures wt

We can tell this because the wt field in each higher-order features. This is the weight associated with the β for the feature. Within the features

(R) higherOrderFeatures features

features is a map where the keys are the feature names from the top-level features map in the regression model. The value associated with a key is a non-empty Array containing the feature keys from the features created by the top-level features map. That is, when a feature function produces the Iterable of (String, Double) key-value pairs, the String in the first index of the tuple is the String that goes in this Array. It is possible to have Arrays of length larger than two. This will happen most commonly occurs when the term of interest in the polynomial is some variable raised to a power.

For instance, imagine someone throwing a ball 80 mph (35.7632 m/s) at a 30° angle with a release point 8.25ft (2.5146 m) off the ground. The vertical velocity component is 17.8816 m/s = 0.5 × 35.7632. Using the equation from classical mechanics:

h(t) = 1/2gt2 + v0t + h0 Where g = -9.8 m/s

the height of the ball in meters could be encoded as:

{
  "modelType": "Regression",
  "modelId": { "id": 0, "name": "80mph throw at 30 degree angle" },
  "features": {
    "intercept": "intercept",
    "time": "${time}"
  },
  "weights": {
    "intercept": 2.5146,
    "time": 17.8816
  },
  "higherOrderFeatures": [
    { "wt": -4.9, "features": { "time": [ "time", "time" ] } } 
  ]
}

(R) spline

spline is an optional field representing a spline function. The spline is applied to the inner product of the feature vector, X and the &betal vector (); and acts like an inverse link function in GLMs. The reason a spline is provided rather than an explicit link is that oftentimes we want to calibrate our classifiers to ensure that the class probabilities yield accurate estimates. For more information see Charles Elkan’s and Bianca Zadrozny’s papers on model calibration or checkout Jan Hendrik Metzen’s blog post on calibration for a nice explanation.

Field Name JSON Type JSON Subtype Required Default
min Number N / A true N / A
max Number N / A true N / A
knots Array Number true N / A
(R) spline min

min is the minimum value in the domain of the spline. Values less than min will be clamped.

(R) spline max

max is the minimum value in the domain of the spline. Values greater than max will be clamped.

(R) spline knots

knots defines the piecewise linear spline. The knots are the values of the codmain. The associated domain values can be calculated from the min, max, and size of the knots Array. The size of knots must be at least two, unless in the special case min equals max, in which case knots is expected to contain exactly one element.

(R) numMissingThreshold

numMissingThreshold is an optional integral value. When supplied, if the number of features that yield empty Iterables (size zero) exceeds numMissingThreshold, then an error rather than a score will be returned.

(R) JSON Examples

(R) Basic Example

This is the most basic regression model there is, just an intercept term:

0.5 = βX = [0.5]T[1]T

{
  "modelType": "Regression",
  "modelId": { "id": 0, "name": "basic example" },
  "features": { "intercept": "intercept" },
  "weights": { "intercept": 0.5 }
}

There is a little bit going on here behind the scenes. The intercept function that appears as the sole value in the features map is a special function that is included when importing com.eharmony.aloha.feature.BasicFuncions._. It is in the com.eharmony.aloha.feature.Intercept trait that is mixed into BasicFuncions. What this function does is create one key-value pair: ("", 1.0). Then, as has been mentioned before, the feature name is prepended to the key, so the final key-value pair is ("intercept", 1.0). The 1.0 represents the bias term and the associated intercept in the weights map represents the model’s output when all covariate data is omitted.

(R) higherOrderFeatures Example

This is repeated from the higherOrderFeatures features section. It illusrates modeling the equation h(t) = 1/2gt2 + v0t + h0. See that section for more information.

{
  "modelType": "Regression",
  "modelId": { "id": 0, "name": "80mph throw at 30 degree angle" },
  "features": {
    "intercept": "intercept",
    "time": "${time}"
  },
  "weights": {
    "intercept": 2.5146,
    "time": 17.8816
  },
  "higherOrderFeatures": [
    { "wt": -4.9, "features": { "time": [ "time", "time" ] } } 
  ]
}

Segmentation model

Segmentation models take a submodel that returns a value with an imposed total ordering and then create a partition on the set of values the submodel produces. The segmentation model maps elements in the same induced equivalence class to a label unique to each class.

(S) JSON Fields

Field Name JSON Type JSON Subtype Required Default
modelType String N / A true N / A
modelId Object N / A true N / A
subModel Model* N / A true N / A
subModelOutputType String N / A true N / A
thresholds subModel Output subModel Output true N / A
labels Array Output true N / A

(S) JSON Field Descriptions

(S) modelType

modelType field must be Segmentation.

(S) subModel

submodel must either be a JSON object containing an Aloha model or it can be a JSON object that represents the import of an external model. The import field has an associated value which is a JSON string containing an Apache VFS URL. For instance:

{ "import": "/path/to/aloha/model.json" }

Either way, the submodel is expected to be a model that takes the same input type as the Segmentation model itself and the submodel should have an output type dictated by the subModelOutputType value.

(S) subModelOutputType

subModelOutputType determines the output type of the submodel. subModelOutputType must be one of:

  • Byte
  • Short
  • Int
  • Long
  • Float
  • Double
  • String

(S) thresholds

thresholds is the set of values that partitions the space into intervals. It should be sorted according to the natural total ordering associated with the output type of the submodel.
This Array must have one less element than the labels Array. If, for instance, the submodel produces a value “less than” the first value in thresholds, then the first element in labels is returned. If that’s not the case, the search continues until an element in thresholds is greater than the submodel output. If there is no such value, the last element in labels is returned.

(S) labels

labels contains the labels associated with each equivalence class induced by thresholds. The type of each label must be the same as the Segmentation model output type. labels must have one more element that thresholds. See thresholds section for more details.

(S) JSON Examples

The following shows a model whose output type is String, with a submodel that returns a value in { 1, 4, 5, 7 }.
The mapping from submodel value to Segmentation model value is:

subModel Segmentation
1 smallest
4 second smallest
5 second largest
7 largest
{
  "modelType": "Segmentation",
  "modelId": { "id": 0, "name": "" },
  "thresholds": [ 2, 5, 6 ],
  "labels": [ "smallest", "second smallest", "second largest", "largest" ],
  "subModelOutputType": "Int",
  "subModel": {
    "modelType": "CategoricalDistribution",
    "modelId": { "id": 1, "name": "" },
    "features": [ "${profile.id}", "1" ],
    "probabilities": [ 0.25, 0.25, 0.25, 0.25 ] ,
    "labels": [ 1, 4, 5, 7 ],
    "missingOk": true
  }
}

Vowpal Wabbit model

The Vowpal Wabbit model provides the feature extraction capabilities of an Aloha native Regression model, but delegates to VW’s JNI layer for prediction, which is much more powerful. Note that this shares a fair amount of code with Regression model so be sure to look at the docs there.

NOTE: This model is in module aloha-vw-jni, not aloha-core. To use, be sure to include the proper maven dependency:

<dependency>
  <groupId>com.eharmony</groupId>
  <artifactId>aloha-vw-jni</artifactId>
  <version>...</version>
</dependency>

VW does not need to be installed on the computer where the Aloha VW model is used. The VW JNI library contains operating system-specific versions of the VW library, so it will be included transitively when the aloha-vw-jni maven dependency is pulled in. Since VW is not binary compatible across versions, the trained VW binary model needs to be trained on the same VW version that is included in the aloha-vw-jni POM. Make sure to look for this dependency.

<dependency>
 <groupId>com.github.johnlangford</groupId>
 <artifactId>vw-jni</artifactId>
 <version>[THIS VERSION NEEDS TO MATCH THE VW USED TO TRAIN THE MODEL]</version>
</dependency>

(VW) JSON Fields

Field Name JSON Type JSON Subtype Required Default
modelType String N / A true N / A
modelId Object N / A true N / A
features Object N / A true N / A
vw Object N / A true N / A
namespaces Object N / A true N / A
numMissingThreshold Number N / A false N / A
notes Array String false N / A
spline Object N / A false N / A

(VW) JSON Field Descriptions

(VW) modelType

modelType field must be VwJNI.

(VW) features

See Regression model features for details.

(VW) vw

The vw Object has two main components. The first component is a representation of the binary VW model, or “initial regressor” in VW terminology. This is the model parameterization. The second component is the set of parameters used by VW. This includes things like the link function, prediction ranges, skip grams, namespace interactions, regularization, etc. To see descriptions of these parameters, see the VW wiki or run:

vw -h
Field Name JSON Type JSON Subtype Required Default
creationTime Number N / A false number of milliseconds since epoch for current time.
model String N / A false N / A
modelUrl String N / A false N / A
via String with one of: "file", "vfs1", "vfs2" N / A false N / A
params String or Array of String N / A false Empty String
(VW) vw creationTime

creationTime is optional and is used when copying the content of the binary VW model to the local disk on the computer where the Aloha model is run. When no value is supplied, the value will be set to java.lang.System.currentTimeMillis().

(VW) vw model

Exactly one of model or modelUrl is required. When model is provided, it should be a string with the base64-encoded content of the binary VW model. For instance, this may be something like the string:

BgAAADguMC4wAG0AAAAAAACAPxIAAAAAAAAAAAAAAAAAAAABAAAAAAA=

which might come from:

echo "" | vw -f model.vw 2>/dev/null && cat model.vw | base64 && rm -f model.vw
(VW) vw modelUrl

Exactly one of model or modelUrl is required. When modelUrl is provided, it should be a string contain an Apache VFS URL. This is the location from which the content will be copied to the local disk.

(VW) vw via

via is optional. It is a string and can be one of: file, vfs1, vfs2. It provides a way for Aloha users to use different versions of Apache VFS. This is useful because VFS provides a common interface to different file systems. Since VFS provides a plugin architecture, different plugins might be available to different versions of VFS. For instance, the HDFS plugin eHarmony uses (at the time of this writing) is a VFS 1 plugin.

via is only to be used in conjunction with the modelUrl field. When supplied with the model field, the value associated with the via field is ignored. File operations used to copy to the local disk the contents associated with the model field are java.io.File based and don’t make use of Apache VFS.

(VW) vw params

params is not required. It can either be specified as a string or an Array of strings (which are joined with a space in between each item). This is how VW parameters are specified.

(VW) namespaces

A JSON object representing a map from namespace names to an Array of feature names to be placed in that namespace. For example:

  "features": {
    "height_mm": "${profile.height_cm} * 10"
  },
  ...
  "namespaces": {
    "personal_features": [ "height_mm" ]
  }

(VW) numMissingThreshold

See Regression model numMissingThreshold for details.

(VW) notes

This is just an optional Array of Strings for documentation purposes.

(VW) spline

See Regression model spline for details.

(VW) JSON Examples

(VW) Base64 Encoded Model JSON Example

{
  "modelType": "VwJNI",
  "modelId": { "id": 0, "name": "" },
  "features": {
    "height_mm": "Seq((\"1800\", 1.0))"
  },
  "namespaces": {
    "personal_features": [ "height_mm" ]
  },
  "vw": {
    "params": "--quiet -t",
    "model": "[base64 encoded model string here]"
  }
}

(VW) Model URL JSON Example

{
  "modelType": "VwJNI",
  "modelId": { "id": 0, "name": "model name" },
  "features": {
    "height_mm": "Seq((\"1800\", 1.0))"
  },
  "notes": [
    "This is a note"
  ],
  "namespaces": {
    "personal_features": [ "height_mm" ]
  },
  "vw": {
    "params": ["--quiet", "-t"],
    "model": "file:///Users/xyz/model.vw"
  }
}

(VW) Model URL via ‘vfs1’ JSON Example

{
  "modelType": "VwJNI",
  "modelId": { "id": 0, "name": "model name" },
  "features": {
    "height_mm": "Seq((\"1800\", 1.0))"
  },
  "notes": [
    "This is a note"
  ],
  "namespaces": {
    "personal_features": [ "height_mm" ]
  },
  "vw": {
    "params": ["--quiet", "-t"],
    "model": "hdfs://some/path/to/data",
    "via": "vfs1"
  }
}

H2O model

H2O model is a wrapper around the H2O library that provides feature extraction and delegates to H2O for predictions. H2O provides many types of models. Users will most likely find the biggest benefit in using the non-linear model offerings.

NOTE: This model is in module aloha-h2o, not aloha-core. To use, be sure to include the proper maven dependency:

<dependency>
  <groupId>com.eharmony</groupId>
  <artifactId>aloha-h2o</artifactId>
  <version>...</version>
</dependency>

H2O does not need to be installed on the computer where the Aloha H2O model is used. Since H2O is a java library and the model is embedded in an Aloha model and is compiled on the fly, ensure that you are producing a model with an H2O version compatible with the one found in the aloha-h2o/pom.xml. Search for h2o.version in the pom.

(H) JSON Fields

Field Name JSON Type JSON Subtype Required Default
modelType String N / A true N / A
modelId Object N / A true N / A
features Object N / A true N / A
numMissingThreshold Number N / A false
model String N / A false N / A
modelUrl String N / A false N / A
via String with one of: "file", "vfs1", "vfs2" N / A false "vfs2"

(H) JSON Field Descriptions

(H) modelType

modelType field must be H2o.

(H) features

features is similar to features found in the Regression model. Like in the Regression model, the features JSON object represents a map from feature name to a extraction function. There are however differences between the two. Since the variable types in H2O can be either Double or String, the JSON has to allow for this. So the JSON for a feature can take on one of two forms. It can either be a JSON string or an object.

  1. If the feature is encoded as a string, it is assumed that the feature is a Double-based feature and that no default will be supplied to the H2O if any of the data used to compute the feature is missing.
  2. If the feature is encoded as an object, it must supply a type field with a value of either string or double. It must also have a spec field whose value is the feature specification. Additionally, an optional defVal field may be provided that determines the feature’s return value in the case that data required in the feature computation is missing. The default must be the same type as the one indicated in the type field. For instance, in the following features map, weight_1 and weight_2 have an equivalent meaning.
{
  "weight_1": "${u.wt}",
  "weight_2": { "type": "double",
                "spec": "${u.wt}" },
  "wt_w_def": { "type": "double",
                "spec": "${u.wt}",
                "defVal": 0 },
  "height":   { "type": "string",
                "spec": "if (${u.ht} < 84) \"UNDER_7\" else \"OVER_7\"" },
  "ht_w_def": { "type": "string",
                "spec": "if (${u.ht} < 84) \"UNDER_7\" else \"OVER_7\"",
                "defVal": "UNDER_7" }
}

(H) model

Exactly one of model or modelUrl is required. When model is provided, it should be a string with the base64-encoded content a generated H2O model.

(H) modelUrl

Exactly one of model or modelUrl is required. When modelUrl is provided, it should be a string contain an Apache VFS URL. This is the location from which the content will be copied to the local disk.

(H) via

via is optional. It is a string and can be one of: file, vfs1, vfs2. It provides a way for Aloha users to use different versions of Apache VFS. This is useful because VFS provides a common interface to different file systems. Since VFS provides a plugin architecture, different plugins might be available to different versions of VFS. For instance, the HDFS plugin eHarmony uses (at the time of this writing) is a VFS 1 plugin.

via is only to be used in conjunction with the modelUrl field. When supplied with the model field, the value associated with the via field is ignored. File operations used to copy to the local disk the contents associated with the model field are java.io.File based and don’t make use of Apache VFS.

(H) numMissingThreshold

numMissingThreshold is an optional integral value. When supplied, if the number of features no value exceeds numMissingThreshold, then an error rather than a score will be returned. Note that when a feature default is provided and that default is returned, it is not counted as a missing feature.

(H) JSON Examples

{
  "modelType": "H2o",
  "modelId": { "id": 0, "name": "proto model" },
  "notes": [
    "The features are all calculated in different varieties of an identity function to show off some of the power of Aloha"
  ],
  "features": {
    "Sex":            { "type": "string", "spec": "${sex}.name.substring(0,1)" },
    "Length":         "1d + ${length} - 1L",
    "Diameter":       "${diameter} * 1f",
    "Height":         "identity(${height})",
    "Whole weight":   "${weight.whole} * ${height} / ${height}",
    "Shucked weight": "pow(${weight.shucked}, 1)",
    "Viscera weight": "${weight.viscera} * (pow(sin(${diameter}), 2) + pow(cos(${diameter}), 2))",
    "Shell weight":   "${weight.shell} + log((${length} + ${height}) / (${height} + ${length}))",
    "Circumference (unused)":  "Pi * ${diameter}"
  },
  "modelUrl": "res:com/eharmony/aloha/models/h2o/glm_afa04e31_17ad_4ca6_9bd1_8ab80005ce38.java"
}

Epsilon greedy exploration model

An epsilon greedy exploration model is a model used for exploring over a set of actions in bandit and reinforcement learning style problems. Epsilon greedy makes use of a parameter called epsilon and a default policy. The model will choose a random action with probability epsilon and an action from the default policy with probability 1 - epsilon.

(EG) JSON Fields

Field Name JSON Type JSON Subtype Required Default
modelType String N / A true N / A
modelId Object N / A true N / A
salt String N / A true N / A
epsilon Number N / A true N / A
defaultPolicy Object N / A true N / A
classLabels Array N / A true N / A

(EG) JSON Field Descriptions

(EG) modelType

modelType field must be EpsilonGreedyExploration.

(EG) salt

salt is an Aloha specification that will produce a Long that will be used as the salt for the randomization. This allows the randomization to be repeatable.

(EG) epsilon

epsilon is a floating point number which signifies the tradeoff between exploration and exploitation. Actions are chosen at random with probability epsilon and from the default policy with probability 1 - epsilon.

(EG) default policy

defaultPolicy is the model to be used to return the learned action. This model must evaluate to an Int as it is used within the MWT Epsilon Greedy Explorer.

(EG) class labels

classLabels is an array of the output type of the model. The result from either the default policy or the random action is used as a lookup into the classLabels.

(EG) JSON Examples

{
  "modelType": "EpsilonGreedyExploration",
  "modelId": {
    "id": 1,
    "name": "Epsilon Greedy Model"
  },
  "epsilon": 0.1,
  "salt": "${profile.user_id}",
  "defaultPolicy": {
    "import": "file:///x/y/z.json"
  },
  "classLabels": [1, 2, 3]
}

Bootstrap exploration model

A bootstrap exploration model is a model used for exploring over a set of actions in bandit and reinforcement learning style problems. Bootstrap works over a set of policies. The algorithm chooses one policy at random and uses that policy’s action as the chosen action.

(B) JSON Fields

Field Name JSON Type JSON Subtype Required Default
modelType String N / A true N / A
modelId Object N / A true N / A
salt String N / A true N / A
policies Array N / A true N / A
classLabels Array N / A true N / A

(B) JSON Field Descriptions

(B) modelType

modelType field must be BootstrapExploration.

(B) salt

salt is an Aloha specification that will produce a Long that will be used as the salt for the randomization. This allows the randomization to be repeatable.

(B) policies

policies are the models to be used to return the action. All the models must evaluate to an Int as it is used within the MWT Bootstrap Explorer.

(B) class labels

classLabels is an array of the output type of the model. The result from either the default policy or the random action is used as a lookup into the classLabels.

(B) JSON Examples

{
  "modelType": "BootstrapExploration",
  "modelId": {
    "id": 1,
    "name": "Bootstrap Exploration Model"
  },
  "salt": "${profile.user_id}",
  "policies": [
    {
      "import": "file:///x/y/z.json"
    },
    {
      "import": "file:///x/y/a.json"
    }
  ],
  "classLabels": [1, 2, 3]
}