Taxonomies consist of machine-interpretable semantics and support many knowledge-rich applications such as query understanding, content browsing, and personalized recommendation. Enormous efforts have been made on constructing taxonomies either manually or semi-automatically. However, as human knowledge is constantly growing, existing taxonomies will become outdated and fail to capture emerging knowledge. Therefore, in many applications, dynamic expansions of an existing taxonomy are in great demand.
We aim to develop an automated method that expands an existing taxonomy to incorporate a set of new concepts. Take the below figure as an example. We are given a Computer Science field-of-study taxonomy and we expand it to incorporate new concepts (e.g., UDA, Meta Learning, TPU) by finding their more appropriate parents (e.g., UDA under Semi-supervised Learning, Meta Learning under Machine Learning, and TPU under Integrated Circuit).
We propose a framework named TaxoExpan for automatic taxonomy expansion. The core of TaxoExpan is a matching model which predicts whether a “query concept” should be placed under a candidate “anchor concept” (that is, whether the “anchor concept” is the parent of “query concept”). Then, we generate “self-supervision” data from the existing taxonomy to train the matching model. Finally, at the inference stage when a new concept comes, we pair this concept with each node in the existing taxonomy, apply learned matching model on all pairs, and predict the node with the largest matching score as the query concept’s parent node.
Below the architecture of our matching model. Notably, we leverage the ego network (egonet) centered around the “anchor concept” to capture its neighborhood information. This egonet is then passed though a graph propagation module and a graph readout module to generate a summary vector that serves as the anchor representation. Finally, this anchor representation is combined with the query concept representation and fed into a matching module (e.g., multi-layer perceptron or log-bilinear model) to output the final matching degree.
To train the above matching model, we generate self-supervision data from the existing taxonomy. Take the below figure as an example, we first randomly select an edge in the taxonomy (e.g., <room, gallery>) as the positive pair. Then, we construct K negative pairs by treating the child node in the selected edge (i.e., gallery) as the query and randomly selecting a set of K nodes that are neither parents nor descendants of this query. After that, we group these K+1 pairs (one positive and K negatives) into one training instance and repeat the above process to generate multiple instances. Finally, we learn the model based on InfoNCE loss using generated self-supervision data.
On MAG Dataset
|Method||Mean Rank||Hit@1||Hit@3||Mean Recriprocal Rank|
On SemEval Dataset
Expanded Computer Science Taxonomy
Expanded All Scienetific Fields Taxonomy