Figure Mining at Scale

I’ve spent some time, over the last month or so, mining figures from a large document preprint server. The value of the information stored within is hard to overestimate, covering a huge cross section of Physics and Mathematics, and several other sciences. The goal of this work is to find figures, and create mixture models of the pixels. These image features after normalization steps may then be used to train one or more ML methods, in order to load the appropriate data extraction workflow for a given figure. However, the data has great value independent of its use as a training set. By examining the images annotated with the Gaussian indices, it’s possible to quickly develop a scaled data set from the mixture model, representative of the original data used to produce the plot. So far, somewhere around 5-10% of the complete document set has yielded up approximately 200k CSV spreadsheets of mixture models, accompanied by annotated figure images. At some point soon, I’ll release CSV and associated PNG files; should be an interesting scaling test of angular/ mongo db, not to mention a good pricing experiment for my AWS account 🙂