Digging Out Data With Adobe PDF Extract API

There is an untold amount of scientific data in the millions of reports and scientific studies over the past few centuries. While much is digitized into easy-to-use PDF formats, the information inside may still be locked away so that it isn’t as easy as it could be to work with it in an aggregate manner. Our recently released PDF Extract API provides a powerful way to get the raw text of a document and intelligently understand the content within. This article will demonstrate how t use this API to gather and aggregate an extensive data set into a unified whole.

Our Data

For this hypothetical example, I imagine an astronomy organization (aptly named Department of Star Light) that examines the luminosity of stars. Before I say anything more, please note I am not an astronomer nor a scientist. Please remember this is all a hypothetical example. Anyway, the DSL (a government agency, so of course, it goes by an acronym) studies the brightness of a set of stairs. Every year, it creates a multipage report on these stars. The report contains a cover page and then twelve pages of tables representing each month of the year.