DeepMath - Deep Sequence Models for Premise Selection
Alex A. Alemi
and
Francois Chollet
and
Geoffrey Irving
and
Christian Szegedy
and
Josef Urban
arXiv e-Print archive - 2016 via Local arXiv
Keywords:
cs.AI, cs.LG, cs.LO
First published: 2016/06/14 (8 years ago) Abstract: We study the effectiveness of neural sequence models for premise selection in
automated theorem proving, one of the main bottlenecks in the formalization of
mathematics. We propose a two stage approach for this task that yields good
results for the premise selection task on the Mizar corpus while avoiding the
hand-engineered features of existing state-of-the-art models. To our knowledge,
this is the first time deep learning has been applied to theorem proving.
#### Introduction
* Automated Theorem Proving (ATP) - Attempting to prove mathematical theorems automatically.
* Bottlenecks in ATP:
* **Autoformalization** - Semantic or formal parsing of informal proofs.
* **Automated Reasoning** - Reasoning about already formalised proofs.
* Paper evaluates the effectiveness of neural sequence models for premise selection (related to automated reasoning) without using hand engineered features.
* [Link to the paper](https://arxiv.org/abs/1606.04442)
#### Premise Selection
* Given a large set of premises P, an ATP system A with given resource limits, and a new conjecture C, predict those premises from P that will most likely lead to an automatically constructed proof of C by A
#### Dataset
* Mizar Mathematical Library (MML) used as the formal corpus.
* The premise length varies from 5 to 84299 characters and over 60% if the words occur fewer than 10 times (rare word problem).
#### Approach
* The model predicts the probability that a given axiom is useful for proving a given conjecture.
* Conjecture and axiom sequences are separately embedded into fixed length real vectors, then concatenated and passed to a third network with few fully connected layers and logistic loss.
* The two embedding networks and the joint predictor path are trained jointly.
##### Stage 1: Character-level Models
* Treat premises on character level using an 80-dimensional one hot encoding vector.
* Architectures for embedding:
* pure recurrent LSTM and GRU Network
* CNN (with max pooling)
* Recurrent-convolutional network that shortens the sequence using convolutional layer before feeding it to LSTM.
##### Stage 2: Word-level Models
* MML dataset contains both implicit and explicit definitions.
* To avoid manually encoding the implicit definitions, the entire statement defining an identifier is embedded and the definition embeddings are used as word level embeddings.
* This is better than recursively expanding and embedding the word definition as the definition chains can be very deep.
* Once word level embeddings are obtained, the architecture from Character-level models can be reused.
#### Experiments
##### Metrics
* For each conjecture, the model ranks the possible premises.
* Primary metric is the number of conjectures proved from top-k premises.
* Average Max Relative Rank (AMMR) is more sophisticated measure based on the motivation that conjectures are easier to prove if all their dependencies occur earlier in ranking.
* Since it is very costly to rank all axioms for a conjecture, an approximation is made and a fixed number of random false dependencies are used for evaluating AMMR.
##### Network Training
* Asynchronous distributed stochastic gradient descent using Adam optimizer.
* Clipped vs Non-clipped Gradients.
* Max Sequence length of 2048 for character-level sequences and 500 for word-level sequences.
##### Results
* Best Selection Pipeline - Stage 1 character-level CNN which produces word-level embeddings for the next stage.
* Best models use simple CNNs followed by max pooling and two-stage definition-based def-CNN outperforms naive word-CNN (where word embeddings are learnt in a single pass).