The Unreasonable Power of Vectors

Evan Radkoff · October 26, 2024

Let’s say you have a dataset of Things. Things could be people, cities, countries, dinosaurs, episodes of The Simpsons, whatever really. You’d like to better understand the dataset as quickly as you can, use it to improve your understanding of Things, and maybe even build software features on top of it. Basically, you want the dataset “at your fingertips”.

In this situation, a really useful intermediate goal is figuring out the best way to turn your dataset into vectors, essentially just lists of numbers, each denoting a dimension with meaning. Vectors represent a vital stepping stone in the standard data science process – done right, they immediately unlock several paradigms for understanding, and building on top of, your data.

There’s a science to designing the right vector space for your use case. Sometimes working directly with tabular features is fine, usually with some scaling/preprocessing. Other times you might want to employ a training paradigm like word2vec, neural networks, or manifold learning. Admittedly this post should really be called “The unreasonable power that comes with representing your dataset as well-formed vectors of mostly-independent normalized features”, that just didn’t sound as snazzy.

But this is not a blog post about how to design vector spaces. Rather, I’ll cover the things you can do once you’ve embedded your dataset as a bag of vectors. Whether these are as simple as 3-dimensional vectors describing demographics, or something complex like 4096-dimensional embeddings from the latest LLM, all of the methods below should be applicable.

Scatterplots

One of the most basic things you can do is visualize your dataset in 2D space. This is especially useful in the exploratory phase of a project to get your bearings, understand the dataset’s overall structure, and identify outliers.

Some datasets might have two features that work great as X and Y dimensions, however not all do. The key step that makes this a universally applicable approach is to automatically reduce your vectors’ dimensionality, all the way down to two, in a way that preserves their global structure. That is to say, the distances between vectors in the original high dimensional space should correlate with the distances in 2D. There are a ton of methods for doing this, each with their pros and cons. I recommend tSNE or UMAP as your go-tos.

Obviously a bunch of dots alone without context are not useful, but you can decorate them with a visualization framework. Point size (for numeric features), shape (for categorical features), and color (for either) offer three ways of highlighting features that help you navigate. You can also add hover tooltips, letting you see as many interpretable features as you’d like.

Throughout this post I’ll be demonstrating with a toy dataset of popular music. The interactive scatterplot below is derived from “genre embeddings”, 128-dimensional vectors that come from the last hidden layer of a genre classification neural network. On desktop, hover over a dot for track metadata (I couldn’t get clicking on mobile to work). Code available here.

Each dot represents a track from a random selection of popular music artists.

Generally we see the tracks of artists end up near each other, which is a good sign. (As a side note, it’s also satisfying to see the dots representing Girl Talk right in between the cluster of hip-hop and other genres, because he makes mash-up music featuring hip-hop vocals over a backdrop of other genres)

For a much more impressive scatterplot, check out the entirety of English Wikipedia on Atlas. Enjoy the rabbit holes.

For those working in the Python ecosystem like myself, I can recommend streamlit as a frontend platform with which to plot. Jupyter notebooks will work fine too, and there’s something to be said for code living right next to the plots it generates, but I find streamlit easier to work with for many use cases, including this blog post. Several plotting libraries are supported: Matplotlib, Altair, Plotly, Bokeh, and more.

Clustering

Once your dataset is in a vector space, you can always compute the distance between any two points. This could be Euclidean distance, cosine distance, or something else, but either way this simple ability unlocks a few go-to data science paradigms for free.

One is clustering – using an algorithm to find logical groupings. This can help you understand segments of your data, and can even work as a classifier for new data points. The scikit-learn library documentation offers a nice overview of common approaches. You’ll notice most of the APIs take a distance measure of your choosing as an input.

The shapes of the points below indicate membership in unsupervised clusters. This means they were not designed to delineate any existing groupings, like genre labels; they were assigned according to the hands-off approach HDBSCAN. And yet, the groups do resemble genres and would be useful for downstream analysis (I’ll admit I’m cheating a little in this example.. the embeddings come from a model optimized to recognize genre, so they have a head start in ending up that way.)

Like before each point represents a track, and the shape of the point is assigned according to the unsupervised clustering algorithm HDBSCAN.

Similarity Search

Another thing you can do after choosing a distance metric is find the most similar entities to some query – a paradigm called kNN (k nearest neighbors), or similarity search. This can even power user-facing search features or recommendation engines. See the example below.

The most similar tracks, according to the smallest Euclidian distance between 128-dimensional genre embeddings.
Click the dropdown to query a different track.

Aggregating

Hierarchies come up all the time with structured data. Documents are made up of paragraphs, which are made up of sentences, which are made up of words. Countries are made up of states. A customer’s activity is made up of individual actions they took.

Often you’ll find you want to navigate up these hierarchies, working with higher-level entities even though you have features/vectors for their components. Vectors allow for an elegant solution: just average each dimension, independently, across all components within a higher-level entity. A big advantage here is that it works with any number of components. For example, let’s say you’d like to measure the similarity between two documents, one with three paragraphs and one much longer with ten. Assuming you had an embedding vector for each paragraph, simply averaging the embeddings of document A’s paragraphs, and separately averaging those of document B’s paragraphs, would give you two vectors of equal length that you can measure the distance between.

Picking up on our example of music from above, we can average together the vectors of tracks within each album to get album vectors, seen below.

After aggregation, each dot now represents an album.

Training ML models

There are many “black box” ML paradigms that do well with inputs of arbitrary tabular data, and learn statistical patterns as needed for downstream tasks. If you’ve already done the work to represent your dataset as well-formed vectors, these downstream tasks can generally be prototyped very quickly. For example, a decent enough supervised classifier might be trainable with just an hour’s worth of labeling from a domain expert.

Another example of where this ease-of-application could come in handy is automated data imputation, or filling in missing values with substitutes. Imagine if a few dozen dimensions had missing values, and your goal was to generate reasonable guesses. Coming up with a unique process for each dimension could be a lot of work. However, if the vectors are in a good enough shape, you can automatically train a supervised regression model to predict each dimension of interest, iteratively holding them out as target labels and using the remaining dimensions as inputs.

Bridging data sources

It’s not uncommon to construct datasets of entities from multiple data sources. For example, maybe each entity is a customer, and you’d like to combine demographic data with purchase history. You might have already finished an analysis, only to discover there is more data about these customers on the way.

The easiest way to combine such data sources is to simply concatenate their vectors. This is a good choice if your downstream application is a ML model, which ideally will learn patterns across each source.

In theory, all of the other tools I’ve described can also work after concatenation. One thing to be mindful of, however, is unintentionally allowing data sources with a large dimensionality to have outsized influence. Many of the methods work by computing pairwise distances between entities, and depending on your distance metric of choice, each dimension treated equally means the number of dimensions has to be considered. One way around this is instead of concatenating vectors, computing the pairwise distance matrix of each data source separately, then averaging them together so that each source has equal weight. Scikit-learn APIs generally accept distance matrices as inputs instead of raw vectors by specifying metric='precomputed', as does umap-learn. This trick also offers an opportunity to customize the influence of each source with a weight factor of your choosing.

The scatterplot projections below are based on the genre-describing vectors used above, in addition to high-level audio features that describe mood, provided by the AcousticBrainz API, which itself uses the essentia library.

Each dot represents a track, as represented by two different data sources – 128-dimensional genre embeddings, and 14-dimensional acoustic mood descriptors provided by AcousticBrainz/essentia.

Bridging modalities

Going beyond data sources, high-level entities might be represented by multiple modes of data entirely. Text, images, videos, audio, etc. might all look very different from each other, but as long as they can be boiled down to vectors of (mostly independent) features, you can work with them together. Connecting modalities is a hot topic in modern ML (see CLIP, used by DALL-E). But the bag-of-vectors paradigm is, I think, the easiest way to do it for limited use cases.

So far the example music vectors above have been made up of features extracted from raw audio. Looking for complementary data, I came across a nice dataset of descriptors from rateyourmusic.com – albums were labeled by human listeners according to a taxonomy of atmosphere, form, lyrics, mood, style, and technique. For example, Nirvana’s Nevermind is labeled [energetic, rebellious, angry, malevocals, apathetic, sarcastic, alienation, passionate, anxious, self-hatred]. First, to get the labels into a more useful semantic vector space, I used UMAP to project the term-document matrix into 16 dimensions (I also tried Truncated SVD and PCA). These are then combined with the audio-based features from before, either at the distance matrix stage, or via concatenation in the case of PCA, while being projected down even further to two dimensions for the scatterplots below.

Each dot represents an album, as represented by up to three different data sources – aggregated 128-dimensional genre embeddings, aggregated 14-dimensional acoustic mood descriptors, and 16-dimensional vectors based on human labels from rateyourmusic.com.

Conclusion

If you employ these paradigms enough times for enough datasets, you’ll start to think in vectors. “My company needs to do X with a bunch of Ys… how would I design vectors to represent Ys?” Or even outside of work, “I should buy a new car soon.. hmm, I wonder what car model vectors would look like?”

Again, the title of this post is a bit facetious. Vectors themselves are not the magic sauce, but rather a mental and practical bridge between worlds. In one world, we obsess over how to represent Things as numbers – deciding their most important features and scrutinizing data quality. In the other world, we’re able to place these numbers inside a black box, and use them to learn and build.

Have some data and a problem to solve?

I'm available for consulting and contract work.

Learn more