Unlike normal data mining processes, embedded operation is in the interference area between data mining and machine learning. It has just exploited knowledge to put it into machine learning, and received training in data exploitation from the network. The mining is like a pipe where the flow of knowledge is progressing over time through training. That means the network has to learn knowledge and adjust the source of knowledge.
Embedded work has two effects:
We consider the passage
“The quick brown fox jumps over the lazy dog”
Each word is represented by a vector in two-dimensional space. The general case will be n-dimensional. These vectors will be learned and moved in space.
These word vectors belong to a embedded matrix with n
columns and m
rows. Where m = vocabulary_size
is the number of vocabulary. Each sample is a paragraph that belongs to a class, words are mapped through a dictionary to its vector indexes
, from which access by index to row vectors in the matrix.
A vector of the same size n takes the average of the vectors in the sample to become the real input vector of the network.
The elements of the embedded matrix are generated from a uniform distribution in the range [-1.0, 1.0). The size n is also called embedded size embedding
.
So through an embedded step we get:
sequence
which is not fixed (large) becomes the fixed length of embedded vector (small).
We process samples in batches, each batch_size
samples. According to tensor language, the initial output will be a 3D tensor with dimension vector (batch_size, sequence, embedding). After averaging we get a 2D tensor with dimension vector (batch_size, embedding), as the input of the network as usual
The hidden node hid_{1} has n + 1
weights as usual, which is a_{10}, a_{11} ... a_{1n}. The network must learn more m × n
weights or parameters of the embedded matrix.
For example, a network is designed with 16 inputs, n = 16. Vocabulary number is m = vocabulary_size = 10000. Then the number of embedded parameters will be 10000 × 16 = 160000.
The partial derivatives of the error function according to the weights on the same column for a sample are the same. We have
Where
The transfer function activates the hidden node. The input is u_{k}, the output is y_{k}. If we use sigmoid function then
So the final derivative formula for an embedded weight is
In which ∂E / ∂y_{k} is the partial derivative of the error function according to the output of the k^{th} hidden node as the normal problem.
Can't see mail in Inbox? Check your Spam folder.
Comments
There are currently no comments
New Comment