Network Representations for Incomplete or Sequential Data
Dissertation defense
Timothy LaRock
Past Talk
Tuesday
Dec 14, 2021
Watch video
11:00 am
Virtual
177 Huntington Ave.
11th floor
Online
Register here

When analyzing complex networks, we are often faced with the decision of how to represent the data as a network. For example, we must decide whether to interpret the interactions between nodes with or without directionality. For some types of data, such as networks formed by computing similarities or correlations between objects represented by nodes, we must also decide which relationships warrant creating a link in the network, and which should be considered noise and ignored. In my dissertation, I address the problem of accurately representing networks in two related but largely different domains. In my first project, I show the limitations of network online learning in dealing with incomplete network data. Then, I investigate various aspects of network representation for modeling entities moving through networks in my three other projects.

Network data can often be incomplete for myriad reasons. For example, the sensors recording the data are likely to be noisy or contain biases, such as inaccurate GPS coordinates used to construct spatially embedded networks. In other cases, we may not have access to the entire network; this is often the case in social media data, where observations are retrieved from an API that limits the amount of data a researcher can query. In Chapter 2 of my dissertation, I explore the following problem: given a sampled network known to be incomplete and a limited number of node queries to an API or oracle, can a function be learned that accurately predicts the best node for the next query to maximize a given objective function? I develop a framework to solve this problem, called Network Online Learning * (NOL*), that learns a model to predict which node query will recruit the largest number of unobserved neighbors into the network at a given time by using simple structural features to predict unobserved degrees. I investigate two algorithms within the NOL* framework: a simple and fast online linear regression algorithm and a more computationally intensive regression algorithm that is more accurate in the presence of heavy tails in the target variable. Through extensive experiments, I find some inherent limitations to learning in this space, which I attribute to macroscopic features of networks--namely, the heterogeneity of the degree distribution and the extent of modular or community structure in the network.

In the remaining chapters, I explore network data through the lens of sequences (i.e. walks or pathways through networks, such as the movement of passengers in public transportation systems). Using traditional directed network representations for this kind of data is fundamentally limited because simple directed networks cannot account for sequential correlations--that is, not only how many times a node or an edge is observed, but also in what order. I address this problem through higher-order networks (HONs), network representations of sequential data that directly incorporate memory into the network structure. In particular, I study kth-order DeBruijn graph models of sequential data (i.e., HONs where each node represents a path of length k and each edge represents a path of length k+1). In Chapter 3, I first introduce the problem of using kth-order DeBruijn graphs to discover path anomalies, defined as kth-order pathways through a graph that are statistically over- or under-represented compared to a lower-order random walk model of the data. I then use DeBruijn graphs to study sequential motifs in observed walks, which are small, directed subgraphs corresponding to types of walks that can be taken through a network (e.g., a directed triangle). I show that we can map the edges of a DeBruijn graph to sequential motifs, then develop null models for the number of observed sequential motifs based on ensembles of complete DeBruijn graphs, and finally describe and implement efficient sampling procedures for these ensembles.

In Chapter 4, I apply the higher-order, walk-based perspective to the global maritime liner shipping network (GLSN), which represents international port-to-port container shipping. Using scheduled liner shipping service route data, I show that previous analyses used suboptimal network representations to analyze container routing on the GLSN. I develop an alternative representation of the routing process that minimizes the number of transfers containers must make between ships, which are costly to shippers and generally avoided. I show that these paths have substantially different characteristics than the shortest paths used in previous work and find that previous analyses inaccurately estimated the role of important links.

Finally, I conclude the dissertation with a discussion of future directions for this research. In particular, I discuss the generalizations of higher-order methods to data representing polyadic interactions, such as hyperedges and simplicial complexes.

Dissertation Committee:

Tina Eliassi-Rad (Chair), Khoury College of Computer Sciences & Network Science Institute, Northeastern University

Samuel V. Scarpino , Rockefeller Foundation; Network Science Institute, Northeastern University

Ingo Scholtes, Center for Artificial Intelligence and Data Science, Julius-Maximilians-Universität Würzburg, Germany; Department of Informatics, University of Zurich, Switzerland

Hongyang Zhang, Khoury College of Computer Sciences, Northeastern University

About the speaker
About the speaker