DeepMath - Deep Sequence Models for Premise Selection on ShortScience.org

arxiv.org
scholar.google.com

DeepMath - Deep Sequence Models for Premise Selection
Alex A. Alemi and Francois Chollet and Geoffrey Irving and Christian Szegedy and Josef Urban
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.AI, cs.LG, cs.LO
more

Summaries/Notes 1

[link] Summary by Shagun Sodhani 7 years ago

#### Introduction

* Automated Theorem Proving (ATP) - Attempting to prove mathematical theorems automatically.
* Bottlenecks in ATP:
* **Autoformalization** - Semantic or formal parsing of informal proofs.
* **Automated Reasoning** - Reasoning about already formalised proofs.
* Paper evaluates the effectiveness of neural sequence models for premise selection (related to automated reasoning) without using hand engineered features.
* [Link to the paper](https://arxiv.org/abs/1606.04442)

#### Premise Selection

* Given a large set of premises P, an ATP system A with given resource limits, and a new conjecture C, predict those premises from P that will most likely lead to an automatically constructed proof of C by A

#### Dataset

* Mizar Mathematical Library (MML) used as the formal corpus.
* The premise length varies from 5 to 84299 characters and over 60% if the words occur fewer than 10 times (rare word problem).

#### Approach

* The model predicts the probability that a given axiom is useful for proving a given conjecture.
* Conjecture and axiom sequences are separately embedded into fixed length real vectors, then concatenated and passed to a third network with few fully connected layers and logistic loss.
* The two embedding networks and the joint predictor path are trained jointly.

##### Stage 1: Character-level Models

* Treat premises on character level using an 80-dimensional one hot encoding vector.
* Architectures for embedding:
* pure recurrent LSTM and GRU Network
* CNN (with max pooling)
* Recurrent-convolutional network that shortens the sequence using convolutional layer before feeding it to LSTM.

##### Stage 2: Word-level Models

* MML dataset contains both implicit and explicit definitions.
* To avoid manually encoding the implicit definitions, the entire statement defining an identifier is embedded and the definition embeddings are used as word level embeddings.
* This is better than recursively expanding and embedding the word definition as the definition chains can be very deep.
* Once word level embeddings are obtained, the architecture from Character-level models can be reused.

#### Experiments

##### Metrics

* For each conjecture, the model ranks the possible premises.
* Primary metric is the number of conjectures proved from top-k premises.
* Average Max Relative Rank (AMMR) is more sophisticated measure based on the motivation that conjectures are easier to prove if all their dependencies occur earlier in ranking.
* Since it is very costly to rank all axioms for a conjecture, an approximation is made and a fixed number of random false dependencies are used for evaluating AMMR.

##### Network Training

* Asynchronous distributed stochastic gradient descent using Adam optimizer.
* Clipped vs Non-clipped Gradients.
* Max Sequence length of 2048 for character-level sequences and 500 for word-level sequences.

##### Results

* Best Selection Pipeline - Stage 1 character-level CNN which produces word-level embeddings for the next stage.
* Best models use simple CNNs followed by max pooling and two-stage definition-based def-CNN outperforms naive word-CNN (where word embeddings are learnt in a single pass).

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private