Blog

Learn about our journey and work.

Optimizing hard problems with confidence


by Martin Rosvall

You want optimal customer segments or product recommendation lists to turn your customer data into maximum value. However, these problems are so hard that no algorithm can guarantee the optimal solution without checking every possible solution. Because you cannot check them all — it would take years — most likely you iterate a stochastic algorithm a few times and pick the best solution among them. A cloud of uncertainty hangs over you: Are the segments good enough? Are they off by far? How much value is at stake? There ought to be a way to find out.

We understand your frustration because we deal with these questions every day at Infobalen and in our academic research. We always wanted to resolve how many times we need to run our clustering algorithms but focused on algorithm development and applications instead. Finally, after repeatedly receiving questions about this problem from other researchers, we decided to give it a go.

In our recent preprint Exploring the solution landscape enables more reliable network community detection, we introduce a procedure to determine the minimum number of searches required for a good result given a user-specified resolution length that sets the accepted accuracy.

The procedure can operate on any distance measure between solutions with a quality score and therefore applies to a broad class of problems. While the specific problem at hand may suggest a particular type of distance measure, a simple and interpretable one is a good start. In the article, we recommend a weighted version of the Jaccard distance. With customers clustered in segments, the distance between two solutions is the weighted average fraction of customers that best-matching segments do not have in common. Accordingly, we can translate an accepted accuracy of at most one percent non-matching customers to a resolution length of 0.01 between similar solutions.

The procedure holds out some solutions, clusters the remaining ones into clusters with similar solutions with a fast and deterministic algorithm, and then tests how many of the holdout solutions fit in the solution clusters.

We continue searching for new solutions until, say, 95 percent of the holdout solutions fit in the solution clusters. In each solution cluster, the solution with the highest quality forms the center, and the user-specified resolution sets its boundary: only solutions within the resolution length from the center belongs to the cluster. In this way, you can require more accurate solutions by using a smaller resolution length.

The fast and deterministic solution-clustering algorithm does not require a prespecified number of clusters and evades the ambiguities that come with multiple solutions. It consists of the following three steps:

  1. Order all solution from highest to lowest quality.
  2. Let the highest quality solution form solution-cluster center 1.
  3. Repeat for not yet clustered solutions: Pick the one with the highest quality and assign it to the first of the m solution-cluster centers that it is closer to than the resolution length. If no such solution-cluster center exists, let it form solution-cluster center m + 1.

For example, in the schematic solution landscape in the figure below, the solution-clustering algorithm first lets the highest quality solution marked with a big filled square form the center of solution cluster 1. In order of decreasing solution quality, it then assigns the remaining solutions marked with filled squares to the same cluster before it lets the one marked with a big filled circle, which does not fit in the same cluster with the given resolution length, form the center of solution cluster 2. Finally, it assigns the remaining solutions marked with filled circles to that cluster.

Because all holdout solutions marked with open symbols in the figure fit in these clusters, the solutions pass the test for the given resolution length. Therefore, we can pick the best solution among them and be confident in a statistical sense that the algorithm will not find a better solution that is significantly different from the best one identified so far.

With this procedure, you can adapt the number of searches to the problem at hand so that you do not waste time with too many searches when dealing with easier problems or miss high-quality solutions after too few searches when dealing with harder problems.

While saving time can be worth a lot, being confident that you are getting the most out of your optimization algorithm can be worth even more.

Working in a distributed team


by Martin Rosvall

Infobaleen was born in an interdisciplinary research lab at Umeå University, where our lives intersected a few years ago. One day we were researchers with a common interest in complex systems. The next day we were also co-founders with a new mission. However, the two-body problem separated us, and now we are spread across five cities in two countries.

As a result, we benefit from productive working hours without interruptions, distractions, long commutes, or limiting 9 to 5 office hours. We take breaks when we need it and go into deep work mode when we can. Forget micromanagement. We are immune. Moreover, we can work from whatever place we want as long as there is a reliable internet connection. Yes, it works, thanks to a shared vision and unconditional dedication.

Of course, working in a distributed team also comes with challenges. While the number of deep work hours has stayed constant, we did not manage to align those hours until we started to use the management framework Objective and Key Results (OKR). Our quarterly OKRs give everyone a clear direction, but they are no bandwagon for increased autonomy and productivity by themselves. Real traction happened when we combined the OKRs with a solution that separated us from the competition.

So much for unlimited individual freedom. No one in the team praises the absence of commutes on days when we see exceptional results in split tests. We also think that it is OK to be interrupted by happy customers.

Other challenges persist. Those who claim that video meetings can replace face-to-face meetings have not met Andrea. While collaboration tools have come a long way, no one can do Andrea justice when he wants to share his intuition about a problem. Give him a whiteboard, and the show begins. Unfortunately, laptop cameras and screens crop, flatten, and distort body language. For Andrea, that means washing out half of the information.

Also, the barrier to initiating conversations is higher when we work in different places. Shooting away messages or setting up video calls are simple one-button clicks that work well for scheduled meetings and urgent issues, but less pressing thoughts rarely trigger video meeting clicks, and casual conversations do not happen unless we force them.

We make casual conversations happen. Each quarter, everyone asks everyone else two questions in a series of pairwise video calls: "What is one thing you think I do well?" and "What is one thing I could change that would help you do a better job?" These questions not only help us to become more productive, but they also stimulate causal conversations that help us to align and unite.

What if the next generation collaboration tools can successfully transmit Andrea's whiteboard shows and complement the merely binary active or not active user status with an automatic time-for-causal-conversation status update? Could we then abandon our physical get-togethers when we set new OKRs? Perhaps, if all we did was to care about productivity and alignment, but then we would forget why we started this journey: We have a lot of fun together.

Discovering stories in complex data requires innovative visualizations


by Martin Rosvall

Rich, relational data of who did what when is a goldmine for understanding a system with many connected things. However, it takes multiple steps of processing, refining, and massaging the data to reach the desired understanding. Even the best clustering algorithms that can identify significant structure do not take you all the way. I learned this as a postdoctoral researcher about ten years ago when I was analyzing the output of a network clustering algorithm that I was developing.

The input was a network with more than six million citations between about six thousand scientific journals and the output a list with 90 clusters, each representing a scientific area. Did the results make sense? I searched up and down the list for journals that I assumed should be clustered together. It took hours, and I did not learn anything new since I could only confirm what I already knew. Moreover, the cluster list lacked essential information about how the clusters were related to each other. I needed a visualization that highlighted scientific areas and their relationships like road maps depict cities connected by highways.

Because there was no tool available for creating such a map, I started with whiteboard sketches and simple scripts for testing different ideas. Step-by-step and with feedback from collaborators, I made my first map of science (see figure above). It depicts scientific areas with circles and their relationships with bidirectional arrows. Their sizes indicate importance. However, when we showed the maps to colleagues, they complained: "Something is wrong with the algorithm, chemistry is too small." Yes indeed, they are chemists. In any case, I went back and checked the clustering algorithm and the map script. No change. Then I went one step further and replaced the input with ten years older citation data. Bingo! We showed the alternative map to the chemists, and they were happy: "Now it looks good! What was wrong in the algorithm?" The algorithm was not wrong, their perception of science was. It was outdated.

This experience taught me two things. First, complex data require powerful visualizations to comprehend and communicate the results. Second, rich data are dynamic and change over time. We needed an efficient visualization to capture that change. With no one available, we set out to create one. We call them alluvial diagrams because they look like alluvial fans of deposit built up by streams. With the alluvial diagrams, we have discovered, for example, how neuroscience emerged as a standalone scientific area mainly from cell biology, neurology, and psychology and how dramatic changes in lending patterns occurred after the Federal Reserve began paying interest on reserve balances during the financial crisis in 2008.

Because we, as well as other researchers, needed these visualizations over and over again to discover stories in complex data, we built interactive tools to transform days of scripting into minutes of customization. They are available on mapequation.org for anyone to use.

At Infobaleen, our customers have the same desire as researchers to discover stories in their data. However, because they all work with some type of transaction data, we can further streamline the visualization tools and eliminate the need for customization. As an example with open data, this interactive map of movies listed on Wikipedia makes it easy to explore and discover movies related to your favorite ones by zooming and dragging. In this case, the transactions come from editors editing movie pages, but with customer transactions, you can find groups of customers with similar and nonoverlapping interests and use them in targeted campaigns.

Whether you are a researcher or a sales manager, we want to empower you with the best algorithms and visualizations for discovering stories in your data.

Customer segmentation by mapping networks for understanding and applications that generate value


by Martin Rosvall

Ever since Aristotle, organization and classification have been cornerstones of science for understanding the world. In network science, where we conduct most of our research, categorization of nodes into modules with so-called community-detection algorithms has proven indispensable to comprehending the structure of large interconnected systems.

With understanding comes powerful applications. For example, geographical maps both depict what we know about the world in the clearest way and aid navigation. Life in unfamiliar cities is an entirely different thing with Google Maps. That is why our vision is to build Google Maps for networks. Our approach is to develop and combine the best clustering algorithms and visualization tools.

Applied to e-commerce transaction data, where the purchasing network consists of customers and products, a good segmentation provides valuable understanding by predicting when and what products customers are most likely to buy next, which can be exploited in automated workflows and personalized communication through email, Facebook, and so forth. Like navigating unfamiliar cities with good maps, turning data into value with powerful tools lead to radically improved efficiency.

One of the clustering approaches that we are using is based on information theory, and hence we have named it Infomap. We have designed and developed the underlying mathematics and the algorithm for solving the clustering problem given by the mathematics and a specific set of relational data.

The underlying mathematics of Infomap identifies modules by compressing the modular description of a complete walk across the network. Applied to e-commerce transaction data, the walk corresponds to a succession of random steps between customers and products: From a customer the random walk continues to one of her purchased products selected at random and then to a customer who purchased the same product also selected at random, and so on to infinity, like an e-commerce analyst exploring the transaction network. If the transaction network has clusters of tightly interconnected customers and products, the walk will spend relatively long periods within those clusters. Using fundamental information theory, Infomap is designed to capitalize on these structures such that the description length is minimized when the clusters capture most structure in the underlying network.

Infomap's unique algorithm for solving the clustering problem consists of three components: the core algorithm, sub-module movements, and single-node movements. The core algorithm is a proven method for quickly achieving an approximate solution: Neighboring nodes are joined into clusters, which subsequently are joined into superclusters and so on. To improve the clustering accuracy, repeated and recursive runs of sub-module movements and single-node movements break the clusters into smaller components to enable fine tuning at different scales.

Investing in a powerful framework that enables straightforward mathematical generalizations and a continuously refined clustering algorithm have succeeded. Our approach has been widely heralded as one of the best of the many dozens of network-clustering algorithms used in thousands of scientific studies.

Academic glory is merely a means to an end. At Infobaleen, we want to leverage the power of Infomap and other clustering algorithms to help companies turn their data into understanding and applications that generate value.

Challenges and opportunities for a university spin-off


by Martin Rosvall

Turning a novel research idea into a university spinoff is a long and challenging endeavor, but fortunately, many people want to see you succeed. For us, the journey started almost ten years ago.

The ability to simplify and highlight structures in large networks had proven useful across several scientific disciplines. Motivated by feedback and thank-you messages from researchers around the world, we worked hard to improve our algorithms, develop new features, and simplify the interface. We knew that turning an overload of relational data into insightful maps that can reveal stories in the data is a universal challenge, so we couldn't resist asking: How much value can we generate to users outside academia? We had to find out.

Luckily, the business developers at Umeå University's tech transfer, Uminova Innovation, saw the potential and shared our curiosity. By discussing business ideas and providing resources to test and refine them, they kick-started our journey into the unknown business world.

However, business developers' support and backslapping can only take you so far. Reality is a different game. When we had talked the talk and excited someone who knew someone else whose problem we might be able to solve, it was time to walk the walk. Outcome: a research paper but no customer.

Indeed, the real challenge is to champion both research and entrepreneurship at the same time. Science requires full commitment. The creative endeavor of turning novel research ideas into published papers that get cited is not a part-time job. Simultaneously pushing the research frontier and exploring business opportunities requires a business companion.

After a fortunate turn of events, an experienced business captain came on board. He could see things from a different perspective and help us to navigate across the gap between academia and business. Still, distilling years of research and technical expertise into a solution that solves customers' problems and then successfully communicating that solution via short windows of attention is a venture. Trying new directions and experiencing setbacks hurts, but there is no other way forward.

Moreover, the process of tweaking a business idea through several pilot projects and iteratively developing a solution also requires a no-bullshit investor and an outstanding team with both a shared vision and a mindset that accepts delayed gratification. For us, a key to the process is a research lab that attracts brilliant students and young researchers who answer "Yes!" to every "Is it possible to build that?" and springboards them into the business endeavor. Optimism and hunger is our recipe for progress.

While our journey hasn't been straight, and has sometimes been rough in shallow waters, now we are sailing on the open sea. It is time to raise the mainsail.