Preprocess Inputs using ElasticDL Preprocessing Layers
This document is a tutorial for ElasticDL preprocessing layers.
ElasticDL Preprocessing Layers
ElasticDL preprocessing is a library for preprocessing input data using TensorFlow. ElasticDL preprocessing provides a number of Keras layers to preprocess data directly in Keras models. For example, using ElasticDL preprocessing layers you could:
- Normalize an input value by using the mean and standard deviation.
- Convert floats to integers by assigning them to buckets and rounding.
- Convert strings to integers by lookuping a vocabulary or hashing.
Normalize input values
For numeric inputs, ElasticDL provides Normalizer to scale the numeric data to a range, Discretization, LogRound and RoundIdentity to map the numeric data to integer values.
Normalizer Layer
The Normalizer
layer is to normalize numeric values by
(x-subtractor)/divisor. For example, we can set the subtractor to the minimum
and divisor to the range size to
implement normalization.
minimum = 3.0
maximum = 7.0
layer = Normalizer(subtractor=minimum, divisor=(maximum - minimum))
input_data = tf.constant([[3.0], [5.0], [7.0]])
result = layer(input_data)
If we want to implement standardization, we can set the subtractor and divisor to the mean and standard deviation.
Convert floats to integers
Discretization Layer
The Discretization
layer is to bucketize numeric data into discrete ranges
according to boundaries and return integer values. For example, if the numeric
data is [19, 42, 55]
and boundaries are [30, 45]
, then outputs are [0, 1,
2]
.
age_values = tf.constant([[34, 45, 23, 67], [15, 37, 52, 47]])
bins = [20, 30, 40, 50]
layer = Discretization(bins=bins)
result = layer(age_values)
The outputs are [[2, 3, 1, 4], [0, 2, 4, 3]]
LogRound Layer
The LogRound
layer is a special case of Discretization
with fixed
boundaries. It casts a numeric value into a discrete integer value by
round(log(x))
.
The base
of LogRound
is the base of the log
operator and the num_bins
is the maximum output value. If the input value is bigger than 2^min_bins
,
the output is also
the num_bins
.
layer = LogRound(num_bins=16, base=2)
input_data = np.asarray([[1.2], [1.6], [0.2], [3.1], [100]])
result = layer(input_data)
The output is [[0], [1], [0], [2], [7]]
RoundIdentity Layer
The RoundIdentity
layer is to cast a float value to a integer value using
round(x)
. Then we can feed integer values into tf.keras.layer.Embedding
. If
the input is bigger than
the max_value
, the output will be the max_value
.
layer = RoundIdentity(max_value=5)
input_data = np.asarray([[1.2], [1.6], [0.2], [3.1], [4.9]])
result = layer(input_data)
The output is [[1], [2], [0], [3], [5]]
Convert strings to integers
ElasticDL provides Hashing and IndexLookup layers to map strings to numeric values..
Hashing Layer
The Hashing
layer is to distribute the string value into a finite number of
buckets by hash(x) % num_bins
.
layer = Hashing(num_bins=3)
input_data = np.asarray([['A'], ['B'], ['C'], ['D'], ['E']])
result = layer(input_data)
The output is [[1], [0], [1], [1], [2]]
IndexLookup Layer
The IndexLookup
layer is to map strings to integer indices by looking up
vocabulary.
layer = IndexLookup(vocabulary=['A', 'B', 'C'])
input_data = np.array([['A'], ['B'], ['C'], ['D'], ['E']])
result = layer(inputs)
The output is [[0], [1], [2], [3], [3]]
Embedding for Preprocessing Results
After preprocessing layers, we get numeric tensors. These numeric tensors can fed into NN layers. Here, we provide some examples of using preprocessing layers to provide inputs for embedding layers.
Embedding for Features Group
Sometimes, we may divide input features into groups and use the same embedding layer for one group. Firstly, we may convert inputs to zero-based integer values using the above preprocessing layers. Then, we can concatenate those outputs into a big tensor. For example, the data set is
education | marital-status |
---|---|
Master | Divorced |
Doctor | Never-married |
Bachelor | Never-married |
Then, we use preprocessing layers to convert the input data to zero-based integer values.
education = tf.keras.layers.Input(shape=(1, ), dtype=tf.string, name="education")
marital_status = tf.keras.layers.Input(shape=(1, ), dtype=tf.string, name="marital_status")
education_lookup = IndexLookup(vocabulary=['Master', 'Doctor', 'Bachelor'])
education_result = education_lookup(education)
marital_status_lookup = IndexLookup(vocabulary=['Divorced', 'Never-married', 'Never-married'])
marital_status_result = martial_status_lookup (marital_status)
Outputs are
education_result = [[0], [1], [2]]
marital_status_result = [[0], [1], [1]]
Then, we may want to lookup embedding to map those integer values to different
embedding vectors for different features. What’s more, we want to set
“education” and “martial-status” into a group and lookup embedding with the
“education” and “marital_status” results using the same embedding table. If we
directly
concatenate two results into a tensor [[0, 0], [1, 1], [2, 1]]
and lookup an
embedding table, embedding results are the same for the same integer values of
those features.
It will make information loss. So, we need to cast integer results of different
features into different ranges.
For example, we can add the vocabulary size of education_lookup
to
martial_status_result
and concatenate them into a tensor to lookup embedding.
In the example, the vocabulary size of education_lookup
is 3,
the martial_status_result
is [[3], [4], [4]]
and the concatenated result is
[[0, 3], [1, 4], [2, 4]]
. So we can map the feature values to different
embedding vectors using an embedding table.
ElasticDL provides ConcatenateWithOffset layer to concatenate features in a group and cast integer values to different ranges.
offsets = [0, education_lookup.vocab_size()]
concat_result = ConcatenateWithOffset(offsets=offsets, axis=1)(
[education_result, martial_status_result])
After concatenate features in a group into a tensor, we can feed the tensor
into tf.keras.layer.Embedding
. However, we need to set the input_dim
for
Embedding
layer and the input_dim
should be bigger than the max integer value in the tensor. We can get the
maximum by the preprocessing layers like:
max_value = education_lookup.vocab_size() + marital_status_lookup.vocab_size()
embedding_result = tf.keras.layer.Embedding(max_value, 1)(concat_result)
embedding_sum = tf.keras.backend.sum(embedding_result, axis=1)
Embedding for Feature Group with Missing Values
Generally, there are missing values in the data set and the missing value may be an empty string for string feature and -1 for the numeric feature. For example, there are missing values for “education” and “marital-status”.
education | marital-status |
---|---|
Master | Divorced |
Never-married | |
Bachelor |
We may not want to lookup embedding for those missing values. We need to filter
those missing values before converting those values to zero-based
integer values using preprocessing layers. After filtering missing values, we
need to use tf.SparseTensor
to contain the result. ElasticDL provides the
ToSparse
layer to filter
missing values and return a tf.SparseTensor
.
education = tf.keras.layers.Input(shape=(1, ),
dtype=tf.string, name="education")
marital_status = tf.keras.layers.Input(shape=(1, ), dtype=tf.string,
name="marital_status")
to_sparse = ToSparse(ignore_value='')
education_sparse = to_sparse(education)
marital_status_sparse = to_sparse(marital_status)
Then, we can use IndexLookup
layer to convert the sparse tensors to sparse
integer tensor and concatenate them into a tensor like the above example.
However, tf.keras.layers.Embedding
cannot support tf.SparseTenor
and we need use
elasticdl_preprocessing.layer.Embedding
to lookup embedding with
tf.SparseTenor
.
from elasticdl_preprocessing.layers import Embedding
education_lookup = IndexLookup(vocabulary=['Master', 'Doctor', 'Bachelor'])
education_result = education_lookup(education_sparse)
marital_status_lookup = IndexLookup(
vocabulary=['Divorced', 'Never-married', 'Never-married'])
marital_status_result = martial_status_lookup (marital_status_sparse)
offsets = [0, education_lookup.vocab_size()]
concat_result = ConcatenateWithOffset(offsets=offsets, axis=1)(
[education_result, martial_status_result])
max_value = education_lookup.vocab_size() + marital_status_lookup.vocab_size()
embedding_result = Embedding(max_value, 1, combiner="sum")(concat_result)
There is another solution to fill the missing value with a default value and convert the default value to a fixed integer value. But the solution has some negative effects and we have discussed it in the issue