Text Analogies

Moving words inside a vector space

One of the main obstacles of modern text based search is the diversity of languages spoken around the globe. Different linguistical heritages, alphabets only used in single language families or meanings attached to the words that differ make it hard to find results without any prior translation. Until today, we usually restrict our search only to languages we are able to understand, ignoring a huge potential that remains hidden because we don't understand the content. Although English is the modern Lingua franca and the scientific community requires publications made in English, all the content aside from it is in a deep sleep. With the advent of machine learning methods, this problem will be led into a new direction, enhancing our knowledge broadly.

To gain new insights in information, words need to be transfered into mathematical language using numbers to describe their meaning. Each word may consist of a variety of information, bearing data like the density in a document, the relation to other words, frequency across the data pool a.s.o. This results in an unique location within a vector space from where a computer may start calculations. Humans, usually only capable of understanding three dimensions - length, width and depth - have a harder time to add additional dimensions. Words processed with machine learning methods move inside a multi-dimensional vector space with hundreds of dimensions.

Within this space and with the help of mathematically converted words, you are able to set relations among entries. Let's say you are looking for a relation between words and start with a reference like "Germany is to Berlin, like France is to ..." which would result to "Paris". The machine looks for the word "Germany" and calculates the position and direction of "Berlin". It then measures the distance between the both and uses this information to relocate it to the word "France". From there, the word "Paris" should appear as the correct result. The image below only shows a two-dimensional space (the machine learning space features far more) but gives you an impression how it works. The angle and distance of "Germany" and "Berlin" is mirrored to "France" and "Paris".

What's the gain?

Aside from the technical aspect, one could question the relevance of this feature. The beauty of the idea rests in the conversion across vector spaces. Using machine learning methods for search tasks is a highly different approach to common search technologies. Instead of only depicting a word by its occurrences, the machine learning approach tries to mathematically "understand" the meaning by adding numerous aspects to describe it, resulting in the multi-dimensional vector space. Interestingly, the resulting locations and directions, are similar in nearly every language. The word "house" has the same location and direction inside of the English language space as its German equivalent "Haus" or its Arabic counterpart "منزل". So, if you use text analogies in your native language, with this technique, you are also able to search trough foreign language documents.

Or let's say, you are trying to retrieve connections between documents to gain new insights. Since the machine created a model with assumptions and predictions, sometimes relations may occur that had been previously unknown. While looking for a certain keyword and it's relation to another, the machine could suggest similar relations in a following look-up. The search for a certain drug in relation to a symptom might be related to another drug. Simply by mathematical calculations.

A short demonstration

We trained a machine for the following demonstration to show how this works. The input needs a sample relation between two words and another word to find the analogy. It then starts to calculate the direction and location for the resulting finding. Interestingly, the machine also offers findings, that at first glance might not be relevant or plain wrong. This either requires additional training or some deeper look why the machine produced these results.