SPv2 is Science-parse, version 2. It uses a multi-layer, bidirectional LSTM to parse structured data out of PDFs. At the moment, we parse out
Other fields are work in progress.
There is an old version of the science-parse project at https://github.com/allenai/science-parse. That version works differently, and does generally worse, but it's faster, and has more features than this one.
The easiest way to get started is to pull down the docker image. The name of the image is allenai/spv2:2.10
. You should be able to get going with it by running
docker run -p 8081:8081 allenai/spv2:2.10
Note:
To get some parsed papers out, you can use curl
:
curl -v --data-binary @paper.pdf "http://localhost:8081/v1/json/pdf"
In the default configuration, this server tries to balance performance and footprint. If you process a lot of documents, it will eventually use about 14G of memory, and you will get acceptable performance out of it. To get real good performance, on many thousands of documents, the documents have to be submitted in batches of 50 or 100. The server does not support that scenario right now. Feel free to file an issue, or send a pull request, if you need that functionality.
SPv2 maintains a "Task DB". This is a Postgres database that lives in RDS. It has one row for every paper. Every paper can either be "Scheduled", "Processing", "Failed", or "Done". If a paper is marked as "Done", we store the results in the row as well.
The Task DB also stores which version of SPv2 produced the results. That way, we can run multiple versions side-by-side, and make sure we re-run all the papers when we get a new version.
SPv2 does most of its processing in Kubernetes, under the spv2-prod
namespace. Kubernetes schedules a cronjob that runs daily. Roughly speaking, the cron job processes all the papers that are marked as "Scheduled" in the Task DB, until they are either "Done" or "Failed". Details about this job are below.
This sequence of events happens every day:
The jobs don't have to run in this order. Any order works, and they can even run concurrently.
The cron job is a little bit complicated. Instead of doing any work itself, it spins up two other jobs that do the actual work.
The first of these jobs is a "service", in Kubernetes parlance. It's called the "dataprep service", and it is a normal web service. It takes a paper id, downloads the paper from the S2 public or private bucket, parses and tokenizes it, and returns the tokenized paper. This job is written in Scala and runs on the JVM. That's why it runs separately from the rest of the system.
The second job runs the model. It gets paper ids to process from the Task DB, 100 at a time. It requests tokenized papers from the dataprep service for those papers, churns the data into the right format for the model, and then runs the model. The output is written back to the Task DB.
There is a quirk to the dataprep service that's worth mentioning: Processing a paper can take a long time, often more than a minute. If we use pure round-robin load balancing between the dataprep servers, some will fall behind, because lots of papers will queue up behind the slow ones. So we use a feature of Kubernetes called "readiness probe". Once a second, it will ping the server to see if it is ready to receive more papers. If the server is currently busy with a paper, it will answer "no". This causes Kubernetes to send the workload to other servers. It's a bit of an abuse of the system, since Kubernetes sees an unready server as an anomaly, but we use it during normal processing. In practice, all it means is that the Kubernetes dashboard sometimes shows lots of servers as distressed, even though everything is working as expected.
There are lots of retries sprinkled though the system. Every paper is retried three times before giving up on it. When the worker process requests papers from the dataprep service, it retries a few times as well. When a worker process dies while processing papers, those papers are automatically marked as "Scheduled" again after five minutes.
The with_labels.py
process trains a new model. It takes lots of parameters. The defaults are the ones used for the production model. The most important parameter is the location of the labeled data. SPv2 expects the labeled data in a certain format. You can download the data in that format from the ai2-s2-spv2
bucket in S3.
Here is a brief description of how I assembled the data:
00
go into $pmcdir/00/docs/$sha
directory. Alongside them we store the NXML file that contains the gold metadata for those papers.java -Xmx24g -jar spv2-dataprep-cli-assembly-2.4.jar PreprocessPdf - $pmcdir/$bucket/docs | \ # output to stdout
sort -S 10% -T ~/temp --parallel=4 --compress-program=gzip | \ # Use 10% of main memory for sorting, and place temporaries into ~/temp
bzip2 -c > $pmcdir/$bucket/tokens3.json.bz2 # pbzip is faster, but not always available
python ./token_statistics.py gather $pmcdir/$bucket/tokens3.json.bz2 $pmcdir/$bucket/tokenstats.pickle.gz
python ./token_statistics.py combine $pmcdir/??/tokenstats.pickle.gz $pmcdir/all.tokenstats3.gz
tokens3.json.bz2
file into the right format for the GPU is very CPU-heavy. If you do both at the same time, the GPU will be 90% idle. So instead, you can pre-process the data beforehand, many buckets in in parallel, maybe on several CPU-heavy servers. That's what dataprep2.py
does. You do it like this:
python ./dataprep2.py warm --pmc-dir $pmcdir <list of buckets>
You can make this more efficient, at the expense of parallelism, by warming multiple buckets with one execution.python ./with_labels.py --pmc-dir $pmcdir
A lot of the steps in this list can be done in parallel on multiple buckets. In my setup, I have $pmcdir
available on an NFS file share, so I can have many servers work on them at the same time. At AI2, we have enough computing power that IO is the bottleneck, not CPU.
Also, GNU Parallel is your friend. Whenever I have to do anything to all the buckets, I use GNU Parallel to schedule it.
"Warming the buckets" is going from the tokens6.json.bz2
files to the .h5
files that contain the actual input to the model. This proceeds in three steps:
The model consumes whole pages at a time. Not all pages are the same length, so to process multiple pages in a single batch, we have to mask out some tokens on the shorter pages. To minimize this waste, pages are sorted and processed in groups of approximately the same page length. Because we can't fit all the pages into memory at the same time, pages instead go into a "page pool". The page pool keeps lots of pages around, and returns groups of pages of approximately the same length to the GPU.
The page pool always tries to return batches with the same number of tokens. Sometimes this can mean many short pages, and sometimes it means very few large pages. The important thing to realize here is that different batches contain a different number of pages.
Check the PagePool
class for further details.