Abstract

Knowledge graphs represent real-world data in a directed graph format where two entities, connected by a predicate, represent one fact. Link prediction models predict new relationships using existing entities and predicates, and are trained using benchmarking knowledge graphs. Redundancies exist within benchmarking knowledge graphs that allegedly artificially improve the results of link prediction. It is assumed in the link prediction field that if a model scores well on a highly redundant benchmarking knowledge graph, it may not directly correlate to good performance on other more complex knowledge graphs. This research creates new analysis methods and evaluation metrics for measuring redundancies in knowledge graphs. We use Horn rules to define five different redundancy types: near-duplicate, near-reverse, symmetric, transitive, and Cartesian product. Support and confidence of these rules quantify the redundancy. Using the quantified results, the two main goals are: (1) measuring levels of redundancy in benchmarking knowledge graphs and (2) offsetting link prediction results based on predicate-specific redundancies within knowledge graphs. Using a single metric to report levels (1) confirms high levels of redundancy in FB15k, WN18 and YAGO3-10, which are known to be highly redundant. High levels of redundancy are also seen for BioKG, which is a new finding. Offsetting link prediction results (2) was done by using predicate-specific redundancy values as weights for several metrics borrowed from the information retrieval field: RR, R@k, and BPM@k.  This method resulted in decreased values across each metric for FB15k, WN18RR, and YAGO3-10, indicating that redundancies artificially inflate link prediction scores on these knowledge graphs. However, Hetionet and NELL-995 show increased values across each metric, indicating that redundancies do not have the same impact on those knowledge graphs. Other knowledge graphs showed mixed results across different link prediction models and metrics. These results indicate that redundancies in benchmarking knowledge graphs may not have the same impact across different knowledge graphs, link prediction models, and evaluation metrics.  The new methods introduced to measure redundancy provide important insights to interpret behavior for link prediction. It becomes even more important to be aware of the level of redundancy in knowledge graphs since they lead to unpredictable link prediction results. Since we cannot prove consistent impact of redundancies, the lower performance of link prediction on knowledge graphs with redundancy removed cannot be explained by a lack of redundancy. We argue that removing redundancy from knowledge graphs is not a valid method of handling redundancy, and should not be used as a solution to the problem that redundancies present.

Publication Date

4-28-2026

Document Type

Thesis

Student Type

Graduate

Degree Name

Computer Science (MS)

Department, Program, or Center

Computer Science, Department of

College

Golisano College of Computing and Information Sciences

Advisor

Carlos Rivero

Advisor/Committee Member

Zachary Butler

Advisor/Committee Member

Matthew Fluet

Campus

RIT – Main Campus

Share

COinS