I’ve uploaded about 500k of the 1M or so figures extracted from arxiv, to AWS S3 for storage. As mentioned, each figure is represented as a Gaussian mixture model (in *CSV) and an image showing the locations (indices) of the model components. Below is a quick summary of the dataset. The largest three subject areas are astro-ph, cond-mat and hep-ph, together accounting for more than 50% of the examples.
The numbers for each subject area are further broken down in the bar chart by year, ranging from 2000 (columns 1) through 2003 (columns 4). The data from 2003 is incomplete, since more documents are available yet for extraction.
Each page within each document can be analysed in under a second, providing both model and image showing the fit. When harnessing the massive parallelism of event driven lambda functions, this type of analysis is only limited by the rate at which documents can be uploaded to AWS.As mentioned previously, one can (with a few tweaks) use the models as features, in order to find similar images or in other words related datasets; more on that soon. It’s also relatively easy to get an idea of the data used to generate the plot eg., here the model centroids are rescaled and heuristics applied to extract the data series alone (re-plotted using octave on the right, input on the left) :