Neural network: Using genetic algorithms to train and deploy neural networks: Embedding


Unlike normal data mining processes, embedded operation is in the interference area between data mining and machine learning. It has just exploited knowledge to put it into machine learning, and received training in data exploitation from the network. The mining is like a pipe where the flow of knowledge is progressing over time through training. That means the network has to learn knowledge and adjust the source of knowledge.

Embedded work has two effects:

  • Convert sparse data into uniform distributed knowledge material
  • Convert large, variable input sizes like the number of words in paragraphs, into a fixed small size of the network's real input.

We consider the passage

“The quick brown fox jumps over the lazy dog”

Each word is represented by a vector in two-dimensional space. The general case will be n-dimensional. These vectors will be learned and moved in space.

These word vectors belong to a embedded matrix with n columns and m rows. Where m = vocabulary_size is the number of vocabulary. Each sample is a paragraph that belongs to a class, words are mapped through a dictionary to its vector indexes, from which access by index to row vectors in the matrix.
A vector of the same size n takes the average of the vectors in the sample to become the real input vector of the network.
The elements of the embedded matrix are generated from a uniform distribution in the range [-1.0, 1.0). The size n is also called embedded size embedding.
So through an embedded step we get:

  • Convert the sample sentence to values ​​within [-1.0, 1.0)
  • The sample paragraph length sequence which is not fixed (large) becomes the  fixed length of embedded vector (small).

We process samples in batches, each batch_size samples. According to tensor language, the initial output will be a 3D tensor with dimension vector (batch_size, sequence, embedding). After averaging we get a 2D tensor with dimension vector (batch_size, embedding), as the input of the network as usual

The hidden node hid1 has n + 1 weights as usual, which is a10, a11 ... a1n. The network must learn more m × n weights or parameters of the embedded matrix.

For example, a network is designed with 16 inputs, n = 16. Vocabulary number is m = vocabulary_size = 10000. Then the number of embedded parameters will be 10000 × 16 = 160000.

The partial derivatives of the error function according to the weights on the same column for a sample are the same. We have


The transfer function activates the hidden node. The input is uk, the output is yk. If we use sigmoid function then

So the final derivative formula for an embedded weight is

In which ∂E / ∂yk is the partial derivative of the error function according to the output of the kth hidden node as the normal problem.

Currently unrated


There are currently no comments

New Comment


required (not published)



What is 7 - 4?