Unlike normal data mining processes, embedded operation is in the interference area between data mining and machine learning. It has just exploited knowledge to put it into machine learning, and received training in data exploitation from the network. The mining is like a pipe where the flow of knowledge is progressing over time through training. That means the network has to learn knowledge and adjust the source of knowledge.
Embedded work has two effects:
We consider the passage
“The quick brown fox jumps over the lazy dog”
Each word is represented by a vector in two-dimensional space. The general case will be n-dimensional. These vectors will be learned and moved in space.
These word vectors belong to a embedded matrix with
n columns and
m rows. Where m =
vocabulary_size is the number of vocabulary. Each sample is a paragraph that belongs to a class, words are mapped through a dictionary to its vector
indexes, from which access by index to row vectors in the matrix.
A vector of the same size n takes the average of the vectors in the sample to become the real input vector of the network.
The elements of the embedded matrix are generated from a uniform distribution in the range [-1.0, 1.0). The size n is also called embedded size
So through an embedded step we get:
sequencewhich is not fixed (large) becomes the fixed length of embedded vector (small).
We process samples in batches, each
batch_size samples. According to tensor language, the initial output will be a 3D tensor with dimension vector (batch_size, sequence, embedding). After averaging we get a 2D tensor with dimension vector (batch_size, embedding), as the input of the network as usual
The hidden node hid1 has
n + 1 weights as usual, which is a10, a11 ... a1n. The network must learn more
m × n weights or parameters of the embedded matrix.
For example, a network is designed with 16 inputs, n = 16. Vocabulary number is m = vocabulary_size = 10000. Then the number of embedded parameters will be 10000 × 16 = 160000.
The partial derivatives of the error function according to the weights on the same column for a sample are the same. We have
The transfer function activates the hidden node. The input is uk, the output is yk. If we use sigmoid function then
So the final derivative formula for an embedded weight is
Share on Twitter
Share on Facebook
In which ∂E / ∂yk is the partial derivative of the error function according to the output of the kth hidden node as the normal problem.
Can't see mail in Inbox? Check your Spam folder.