I previously described modeling figure pixels with Gaussian mixtures. A few months ago, I took the same procedure and applied it to over 100k PDF documents in arXiv, which yielded close to 1M figures. The output of the process, for each figure, is a CSV spreadsheet of model parameters, and an image showing the location of each Gaussian. As mentioned, the pixel positions may be rescaled using a simple heuristic, essentially providing a facsimile of the data used to create the figure. Perhaps more importantly, the model parameters serve as image features, allowing for the calculation of similarity measures. Using these measures, one can effectively search for similar data, an obvious application being the detection of plagiarism, or finding related articles on the basis of data in figure images.
Given the size of the dataset, network graphs of relationships between figures quickly become complex and difficult to interpret, and methods like t-SNE from Laurens van der Maaten prove very helpful. In the example below, the similarity matrix for ~ 1000 examples within the same year and subject area was developed and reduced for visualization using t-SNE. One can readily identify nearest neighbor data documents among thousands, if not millions of figures.