
Structured vs. Unstructured Data
Traditionally, data science is performed on what’s known as “structured” data. Structured data has a standardized format, typically tabular (tables with rows and columns). Tables store simple data formats (strings, integers, floats, booleans, etc.) and are easy to aggregate and manipulate. In the age of Excel, this organized structure (also called a relational database) is intuitive to us. However, it has significant limitations, particularly when it comes to unstructured data.
At the risk of sounding glib, unstructured data is everything structured data isn’t. Think of images, videos, audio, text files. How could these data types be stored in a standardized format? For a while, the best answer was blob storage. Blobs are essentially links within relational databases that point to a file. You can attach metadata to blobs, storing information about the file source or date of creation, for example. While this technically stores our unstructured data, it’s not very helpful for understanding the content of a file. This is where vectors come into play.
Why represent data as a vector?
In school, we’re taught that a vector is an array of numbers that can be used to represent points in space. When connected, these points form a line with both direction and magnitude. But what does this mean practically? In computer science, a vector can be understood as a one-dimensional array that unstructured data can be mapped on to. Embedding models convert the features of the data into embeddings, which are represented as float values in our vector. If you look at the embeddings in a vector, it seems that no useful information is being stored. It’s just an array of floats. However, vectors are actually a uniquely rich data type, because they can preserve the context of your data. How?
Static and Contextual Embeddings
Embeddings are numerical representations of your data. An embedding model converts your unstructured data into these embeddings. The mechanics of embedding models are beyond the scope of this article, but there are two types of embedding models that are important to understand. The first is static embeddings.
Static embedding models (Word2Vec, GloVe) use different techniques to assign a vector of weights to each word. Neural network-based models like Word2Vec rely on backpropagation, while matrix factorization-based models like GloVe minimize a cost function related to word co-occurrence. This process ensures that the vectors of similar words (for example “good” and “great”) are closer than the vectors of unrelated words (for example “treadmill” and “candle”). While static embeddings were a major breakthrough, they struggle with polysemantic words and phrases (terms with multiple potential meanings). The classic example is the word “apple.” “Apple” can refer to a type of fruit or to the major software company. In static embedding models, both meanings are contained in the same vector, meaning it is difficult to distinguish which you’re referring to in a given sentence. This is where contextual embeddings come into play.
Contextual embedding models (BERT, ELMo) don’t store one fixed vector for each word. Instead, they use attention mechanisms to spontaneously generate new, context-rich vectors by weighting the influence of the surrounding words or phrases. This allows for the same word to be interpreted in different ways. Let’s return to the “apple” example. In a normal conversation, people understand whether you’re referring to a fruit or a software company based on the context of what you’re saying. Are you talking about orchards and pies or laptops and iPhones? Contextual embeddings work the same way. In this case, “apple” the fruit and “Apple” the software company would be mapped onto very different vectors, despite on the surface being the same word.
Information Retrieval
Now that we have a foundational understanding of embeddings and their ability to capture context, let’s return to our discussion of vectors. Storing embeddings in a vector allows us to perform mathematical processes on the data, such as distance calculations to determine how closely related two vectors are and nearest neighbors search to identify similar vectors. These are the principles behind semantic similarity search, a key information retrieval process.
For example, suppose I have a knowledge base of documents on a variety of subjects, and I want to examine documents that discuss lakes and streams. A traditional keyword search could get us part of the way there, we could ask for all documents containing the word “stream.” However, this would exclude documents containing synonyms for streams, such as creeks, brooks, etc. It could also return documents on television streaming services, which are irrelevant to our search. This is where vectors (and vector databases) come in handy. I can embed my knowledge base into a collection of vectors and store them in a vector database. Then, I can embed my query (“lakes and streams”). This vector of embeddings is then compared to the vectors in my database, and the most similar k documents are returned. So, vectors not only allow us to incorporate context, but they also provide an efficient means of storing and searching information.
Embedding models and vector databases offer a world of possibilities for handling unstructured data. Whether you’re building smarter search engines, chatbots, or recommendation systems, understanding how vectors preserve context is key. If you’re interested in diving deeper, I recommend experimenting with libraries like FAISS or Hugging Face’s embedding models to see these concepts in action.