ShortScience.org - Making Science Accessible!

2

[link] Summary by Shagun Sodhani 9 years ago

## Introduction to elastic net

* Regularization and variable selection method.
* Sparse Representation
* Exihibits grouping effect.
* Prticulary useful when number of predictors (*p*) >> number of observations (*n*).
* LARS-EN algorithm to compute elastic net regularization path.

## Lasso

* Least square method with L1-penalty on regression coefficient.
* Does continuous shrinkage and automatic variable selection

### Limitations

* If *p >> n*, lasso can select at most *n* variables.
* In the case of a group of variables exhibiting high pairwise correlation, lasso doesn't care about which variable is selected.
* If *n > p* and there is a high correlation between predictors, ridge regression outperforms lasso.

## Naive elastic net

* Least square method.
* Penalty on regression cofficients is a convex combination of lasso and ridge penalty.
* *penalty = (1−α)\*|β| + α\*|β|2* where *β* refers to the coefficient matrix.
* *α = 0* => lasso penalty
* *α = 1* => ridge penalty
* Naive elastic net can be solved by transforming to lasso on augmeneted data.
* Can be viewed as redge type shrinkage followed by lasso type thresholding.

### Limitations

* The two-stage procedure incurs double amount of shrinkage and introduces extra bias without reducing variance.

## Bridge Regression

* Generalization of lasso and ridge regression.
* Can not produce sparse solutions.

## Elastic net

* Rescaled naive elastic net coefficients to undo shrinkage.
* Retains good properties of the naive elastic net.

## Justification for scaling

* Elastic net becomes minimax optimal.
* Scaling reverses the shrinkage control introduced by ridge regression.

## LARS-EN

* Based on LARS (used to solve lasso).
* Elastic net can be transformed to lasso on augmented data so can reuse pieces of LARS algorithm.
* Use sparseness to save on computation.

## Conclusion

Elastic net performs superior to lasso.

2

[link] Summary by Shagun Sodhani 9 years ago

# TAO

* Geographically distributed, read-optimized, graph data store.
* Favors availability and efficiency over consistency.
* Developed by and used within Facebook (social graph).
* Link to [paper](https://cs.uwaterloo.ca/~brecht/courses/854-Emerging-2014/readings/data-store/tao-facebook-distributed-datastore-atc-2013.pdf).

## Before TAO

* Facebook's servers directly accessed MySQL to read/write the social graph.
* Memcache used as a look-aside cache.
* Had several issue:
* **Inefficient edge list** - A key-value store is not a good design for storing a list of edges.
* **Distributed Control Logic** - In look-aside cache architecture, the control logic runs on the clients which increase the number of failure modes.
* **Expensive Read-After-Write Consistency** - Facebook used asynchronous master-slave replication for MySQL which introduced a time lag before latest data would reflect in the local replicas.

## TAO Data Model

* **Objects**
* Typed nodes (type is denoted by `otype`)
* Identified by 64-bit integers.
* Contains data in the form of key-value pairs.
* Models users and repeatable actions (eg `comments`).

* **Associations**
* Typed directed edges between objects (type is denoted by `atype`)
* Identified by source object `id1`, `atype` and destination object `id2`.
* Contains data in the form of key-value pairs.
* Contains a 32-bit time field.
* Models actions that happen at most once or records state transition (eg `like`)
* Often `inverse association` is also meaningful (eg `like` and `liked by`).

## Query API

* Support to create, retrieve, update and delete objects and associations.
* Support to get all associations (`assoc_get`) or their count(`assoc_count`)
based on starting node, time, index and limit parameters.

## TAO Architecture

### Storage Layer
* Objects and associations stored in MySQL.
* TAO API mapped to SQL queries.
* Data divided into logical shards.
* Objects bound to the shard for their lifetime(`shard_id` is embedded in `id`).
* Associations stored on the shard of its `id` (for faster association query).

### Caching Layer
* Consists of multiple cache servers (together form a `tier`).
* In memory, LRU cache stores objects, association lists, and association counts.
* Write operation on association list with inverse involves writing 2 shards (for `id1` and `id2`).
* The client sends the query to cache layer which issues inverse write query to shard2 and once that is completed, a write query is made to shard1.
* Write failure leads to hanging associations which are repaired by an asynchronous job.

### Leaders and Followers
* A single, large tier is prone to hot spots and square growth in terms of all-to-all connections.
* Cache split into 2 levels - one **leader tier** and multiple **follower tiers**.
* Clients communicate only with the followers.
* In the case of read miss/write, followers forward the request to the leader which connects to the storage layer.
* Eventual consistency maintained by serving cache maintenance messages from leaders to followers.
* Object update in leaders leads results in `invalidation message` to followers.
* Leader sends `refill message` to notify about association write.
* Leaders also serialize concurrent writes and mediates thundering herds.

## Scaling Geographically

* Since workload is read intensive, read misses are serviced locally at the expense of data freshness.
* In the multi-region configuration, there are master-slave regions for each shard and each region has its own followers, leader, and database.
* Database in the local region is a replica of the database in the master region.
* In the case of read miss, the leader always queries the region database (irrespective of it being the master database or slave database).
* In the case of write, the leader in the local region would forward the request to database in the master region.

## Optimisations

### Caching Server
* RAM is partitioned into `arena` to extend the lifetime of important data types.
* For small, fixed-size items (eg association count), a direct 8-way associative cache is maintained to avoid the use of pointers.
* Each `atype` is mapped to 16-bit value to reduce memory footprint.

### Cache Sharding and Hot Spots
* Load is balanced among followers through `shard cloning`(reads to a shard are served by multiple followers in a tier).
* Response to query include the object's access rate and version number. If the access rate is too high, the object is cached by the client itself. Next time when the query comes, the data is omitted in the reply if it has not changed since the previous version.

### High Degree Objects
* In the case of `assoc_count`, the edge direction is chosen on the basis of which node (source or destination) has a lower degree (to optimize reading the association list).
* For `assoc_get` query, only those associations are searched where time > object's creation time.

## Failure Detection and Handling

* Aggressive network timeouts to detect (potential) failed nodes.

### Database Failure

* In the case of master failure, one of the slaves take over automatically.
* In case of slave failure, cache miss are redirected to TAO leader in the region hosting the database master.

### Leader Failure

* When a leader cache server fails, followers route read miss directly to the database and write to a replacement leader (chosen randomly from the leader tier).

### Refill and Invalidation Failures

* Refill and invalidation are sent asynchronously.
* If the follower is not available, it is stored in leader's disk.
* These messages will be lost in case of leader failure.
* To maintain consistency, all the shards mapping to a failed leader are invalidated.

### Follower Failure

* Each TAO client is configured with a primary and backup follower tier.
* In normal mode, the request is made to primary tier and in the case of its failure, requests go to backup tier.
* Read after write consistency may be violated if failing over between different tiers (read reaches the failover target before writer's `refill` or `invalidate`).

3

[link] Summary by Shagun Sodhani 9 years ago

The [paper](http://vldb.org/pvldb/vol5/p1771_georgelee_vldb2012.pdf) presents Twitter's logging infrastructure, how it evolved from application specific logging to a unified logging infrastructure and how session-sequences are used as a common case optimization for a large class of queries.

## Messaging Infrastructure

Twitter uses **Scribe** as its messaging infrastructure. A Scribe daemon runs on every production server and sends log data to a cluster of dedicated aggregators in the same data center. Scribe itself uses **Zookeeper** to discover the hostname of the aggregator. Each aggregator registers itself with Zookeeper. The Scribe daemon consults Zookeeper to find a live aggregator to which it can send the data. Colocated with the aggregators is the staging Hadoop cluster which merges the per-category stream from all the server daemons and writes the compressed results to HDFS. These logs are then moved into main Hadoop data warehouse and are deposited in per-category, per-hour directory (eg /logs/category/YYYY/MM/DD/HH). Within each directory, the messages are bundled in a small number of large files and are partially ordered by time.

Twitter uses **Thrift** as its data serialization framework, as it supports nested structures, and was already being used elsewhere within Twitter. A system called **Elephant Bird** is used to generate Hadoop record readers and writers for arbitrary thrift messages. Production jobs are written in **Pig(Latin)** and scheduled using **Oink**.

## Application Specific Logging

Initially, all applications defined their own custom formats for logging messages. While it made it easy to develop application logging, it had many downsides as well.

* Inconsistent naming conventions: eg uid vs userId vs user_Id
* Inconsistent semantics associated with each category name causing resource discovery problem.
* Inconsistent format of log messages.

All these issues make it difficult to reconstruct user session activity.

## Client Events

This is an effort within Twitter to develop a unified logging framework to get rid of all the issues discussed previously. A hierarchical, 6-level schema is imposed on all the events (as described in the table below).

| Component | Description | Example |
|-----------|------------------------------------|----------------------------------------------|
| client | client application | web, iPhone, android |
| page | page or functional grouping | home, profile, who_to_follow |
| section | tab or stream on a page | home, mentions, retweets, searches, suggestions |
| component | component object or objects | search_box, tweet |
| element | UI element within the component | button, avatar |
| action | actual user or application action | impression, click, hover |

**Table 1: Hierarchical decomposition of client event names.**

For example, the following event, `web:home:mentions:stream:avatar:profile_click` is logged whenever there is an image profile click on the avatar of a tweet in the mentions timeline for a user on twitter.com (read from right to left).

The alternate design was a tree based model for logging client events. That model allowed for arbitrarily deep event namespace with as fine-grained logging as required. But the client events model was chosen to make the top level aggregate queries easier.

A client event is a Thrift structure that contains the components given in the table below.

| Field | Description |
|-----------------|---------------------------------|
| event initiator | {client, server} × {user, app} |
| event_name | event name |
| user_id | user id |
| session_id | session id |
| ip | user’s IP address |
| timestamp | timestamp |
| event_details | event details |

**Table 2: Definition of a client event.**

The logging infrastructure is unified in two senses:

* All log messages share a common format with clear semantics.
* All log messages are stored in a single place.

## Session Sequences

A session sequence is a sequence of symbols *S = {s0, s1, s2...sn}* such that each symbol is drawn from a finite alphabet *Σ*. A bijective mapping is defined between Σ and universe of event names. Each symbol in Σ is represented by a valid Unicode point (frequent events are assigned shorter code prints) and each session sequence becomes a valid Unicode string. Once all logs have been imported to the main database, a histogram of event counts is created and is used to map event names to Unicode code points. The counts and samples of each event type are stored in a known location in HDFS. Session sequences are reconstructed from the raw client event logs via a *group-by* on *user_id* and *session_id*. Session sequences are materialized as it is difficult to work with raw client event logs for following reasons:
* A lot of brute force scans.
* Large group-by operations needed to reconstruct user session.

#### Alternate Designs Considered
* Reorganize complete Thrift messages by reconstructing user sessions - This solves the second problem but not the first.
* Use a columnar storage format - This addresses the first issue but it just reduces the time taken by mappers and not the number of mappers itself.

The materialized session sequences are much smaller than raw client event logs (around 50 times smaller) and address both the issues.

## Client Event Catalog

To enhance the accessibility of the client event logs, an automatically generated event data log is used along with a browsing interface to allow users to browse, search and access sample entries for the various client events. (These sample entries are the same entries that were mentioned in the previous section. The catalog is rebuilt every day and is always up to date.

## Applications

Client Event Logs and Session Sequences are used in following applications:

* Summary Statistics - Session sequences are used to compute various statistics about sessions.
* Event Counting - Used to understand what feature of users take advantage of a particular feature.
* Funnel Analytics - Used to focus on user attention in a multi-step process like signup process.
* User Modeling - Used to identify "interesting" user behavior. N-gram models (from NLP domain) can be extended to measure how important temporal signals are by modeling user behavior on the basis of last n actions. The paper also mentions the possibility of extracting "activity collocations" based on the notion of collocations.

## Possible Extensions

Session sequences are limited in the sense that they capture only event name and exclude other details. The solution adopted by Twitter is to use a generic indexing infrastructure that integrates with Hadoop at the level of InputFormats. The indexes reside with the data making it easier to reindex the data. An alternative would have been to use **Trojan layouts** which members indexing in HDFS block header but this means that indexing would require the data to be rewritten. Another possible extension would be to leverage more analogies from the field of Natural Language Processing. This would include the use of automatic grammar induction techniques to learn hierarchical decomposition of user activity. Another area of exploration is around leveraging advanced visualization techniques for exploring sessions and mapping interesting behavioral patterns into distinct visual patterns that can be easily recognized.

4

[link] Summary by Shagun Sodhani 9 years ago

The [paper](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf) presents some key lessons and "folk wisdom" that machine learning researchers and practitioners have learnt from experience and which are hard to find in textbooks.

### 1. Learning = Representation + Evaluation + Optimization

All machine learning algorithms have three components:
*    **Representation** for a  learner is the set if classifiers/functions that can be possibly learnt. This set is called *hypothesis space*. If a function is not in hypothesis space, it can not be learnt.
*    **Evaluation** function tells how good the machine learning model is.
*    **Optimisation** is the method to search for the most optimal learning model.

### 2. Its Generalization That Counts

The fundamental goal of machine learning is to generalize beyond the training set. The data used to evaluate the model must be kept separate from the data used to learn the model. When we use generalization as a goal, we do not have access to a function that we can optimize. So we have to use training error as a proxy for test error.

### 3. Data Alone Is Not Enough

Since our ultimate goal is generalization (see point 2), there is no such thing as **"enough"**  data. Some knowledge beyond the data is needed to generalize beyond the data. Another way to put is "No learner can beat random guessing over all possible functions." But instead of hard-coding assumptions, learners should allow assumptions to be explicitly stated, varied and incorporated automatically into the model.

### 4. Overfitting Has Many Faces

One way to interpret overfitting is to break down generalization error into two components: bias and variance. **Bias** is the tendency of the learner to constantly learn the same wrong thing (in the image, a high bias would mean more distance from the centre). **Variance** is the tendency to learn random things irrespective of the signal (in the image, a high variance would mean more scattered points). 

![Bias Variance Diagram](https://dl.dropboxusercontent.com/u/56860240/A-Paper-A-Week/BiasVarianceDiagram.png)

A more powerful learner (one that can learn many models) need not be better than a less powerful one as they can have a high variance.  While noise is not the only reason for overfitting, it can indeed aggravate the problem. Some tools against overfitting are - **cross-validation**, **regularization**, **statistical significance testing**, etc. 

### 5. Intuition Fails In High Dimensions

Generalizing correctly becomes exponentially harder as dimensionality (number of features) become large. Machine learning algorithms depend on similarity-based reasoning which breaks down in high dimensions as a fixed-size training set covers only a small fraction of the large input space. Moreover, our intuitions from three-dimensional space often do not apply to higher dimensional spaces. So the **curse of dimensionality** may outweigh the benefits of having more features. Though, in most cases, learners benefit from the **blessing of non-uniformity** as data points are concentrated in lower-dimensional manifolds. Learners can implicitly take advantage of this lower effective dimension or use dimensionality reduction techniques.

### 6. Theoretical Guarantees Are Not What They Seem  

A common type of bound common when dealing with machine learning algorithms is related to the number of samples needed to ensure good generalization. But these bounds are very loose in nature. Moreover, the bound says that given a large enough training dataset, our learner would return a good hypothesis with high probability or would not find a consistent hypothesis. It does not tell us anything about how to select a good hypothesis space.

Another common type of bound is the asymptotic bound which says "given infinite data, the learner is guaranteed to output correct classifier". But in practice we never have infinite data and data alone is not enough (see point 3). So theoretical guarantees should be used to understand and drive the algorithm design and not as the only criteria to select algorithm.

### 7. Feature Engineering Is The Key

Machine Learning is an iterative process where we train the learner, analyze the results, modify the learner/data and repeat. Feature engineering is a crucial step in this pipeline. Having the right kind of features (independent features that correlate well with the class) makes learning easier. But feature engineering is also difficult because it requires domain specific knowledge which extends beyond just the data at hand (see point 3).

### 8. More Data Beats A Clever Algorithm

As a rule of thumb, a dumb algorithm with lots of data beats a clever algorithm with a modest amount of data. But more data means more scalability issues. Fixed size learners (parametric learners) can take advantage of data only to an extent beyond which adding more data does not improve the results. Variable size learners (non-parametric learners) can, in theory, learn any function given sufficient amount of data. Of course, even non-parametric learners are bound by limitations of memory and computational power. 

    
### 9. Learn Many Models, Not Just One

In early days of machine learning, the model/learner to be trained was pre-determined and the focus was on tuning it for optimal performance. Then the focus shifted to trying many variants of different learners. Now the focus is on combining the various variants of different algorithms to generate the most optimal results. Such model ensembling techniques include *bagging*, *boosting* and *stacking*.

### 10. Simplicity Does Not Imply Accuracy

Though Occam's razor suggests that machine learning models should be kept simple, there is no necessary connection between the number of parameters of a model and its tendency to overfit. The complexity of a model can be related to the size of hypothesis space as smaller spaces allow the hypothesis to be generated by smaller, simpler codes. But there is another side to this picture - A learner with a larger hypothesis space that tries fewer hypotheses is less likely to overfit than one that tries more hypotheses from a smaller space. So hypothesis space size is just a rough guide towards accuracy. Domingos conclude in his [other paper](http://homes.cs.washington.edu/~pedrod/papers/dmkd99.pdf) that "simpler hypotheses should be preferred because simplicity is a virtue in its own right, not because of a hypothetical connection with accuracy."


### 11. Representation Does Not Imply Learnable

Just because a function can be represented, does not mean that the function can actually be learnt. Restrictions imposed by data, time and memory, limit the functions that can actually be learnt in a feasible manner. For example, decision tree learners can not learn trees with more leaves than the number of training data points. The right question to ask is "whether  a function can be learnt" and not "whether a function can be represented". 

### 12. Correlation Does Not Imply Causation

Correlation may hint towards a possible cause and effect relationship but that needs to be investigated and validated. On the face of it, correlation can not be taken as proof of causation.

2

[link] Summary by Shagun Sodhani 9 years ago

[Hive](http://infolab.stanford.edu/~ragho/hive-icde2010.pdf) is an open-source data warehousing solution built on top of Hadoop. It supports an SQL-like query language called HiveQL. These queries are compiled into MapReduce jobs that are executed on Hadoop. While Hive uses Hadoop for execution of queries, it reduces the effort that goes into writing and maintaining MapReduce jobs.

# Data Model

Hive supports database concepts like tables, columns, rows and partitions. Both primitive (integer, float, string) and complex data-types(map, list, struct) are supported. Moreover, these types can be composed to support structures of arbitrary complexity. The tables are serialized/deserialized using default serializers/deserializer. Any new data format and type can be supported by implementing SerDe and ObjectInspector java interface.

# Query Language

Hive query language (HiveQL) consists of a subset of SQL along with some extensions. The language is very SQL-like and supports features like subqueries, joins, cartesian product, group by, aggregation, describe and more. MapReduce programs can also be used in Hive queries. A sample query using MapReduce would look like this:

```
FROM (
MAP inputdata USING 'python mapper.py' AS (word, count)
FROM inputtable
CLUSTER BY word
)
REDUCE word, count USING 'python reduce.py';
```
This query uses `mapper.py` for transforming `inputdata` into `(word, count)` pair, distributes data to reducers by hashing on `word` column (given by `CLUSTER`) and uses `reduce.py`.

Notice that Hive allows the order of `FROM` and `SELECT/MAP/REDUCE` to be changed within a sub-query. This allows insertion of different transformation results into different tables/partitions/hdfs/local directory as part of the same query and reduces the number of scans on the input data.

### Limitations

* Only equi-joins are supported.
* Data can not be inserted into existing table/partitions and all inserts overwrite the data.

`INSERT INTO, UPDATE`, and `DELETE` are not supported which makes it easier to handle reader and writer concurrency.

# Data Storage

While a table is the logical data unit in Hive, the data is actually stored into hdfs directories. A **table** is stored as a directory in hdfs, **partition** of a table as a subdirectory within a directory and **bucket** as a file within the table/partition directory. Partitions can be created either when creating tables or by using `INSERT/ALTER` statement. The partitioning information is used to prune data when running queries. For example, suppose we create partition for `day=monday` using the query

```
ALTER TABLE dummytable ADD PARTITION (day='monday')
```
Next, we run the query -
```
SELECT * FROM dummytable WHERE day='monday'
```
Suppose the data in dummytable is stored in `/user/hive/data/dummytable` directory. This query will only scan the subdirectory `/user/hive/data/dummytable/day=monday` within the `dummytable` directory.

A **bucket** is a file within the leaf directory of a table or a partition. It can be used to prune data when the user runs a `SAMPLE` query.

Any data stored in hdfs can be queried using the `EXTERNAL TABLE` clause by specifying its location with the `LOCATION` clause. When dropping an external table, only its metadata is deleted and not the data itself.

# Serialization/Deserialization

Hive implements the `LazySerDe` as the default SerDe. It deserializes rows into internal objects lazily so that the cost of Deserialization of a column is incurred only when it is needed. Hive also provides a `RegexSerDe` which allows the use of regular expressions to parse columns out from a row. Hive also supports various formats like `TextInputFormat`, `SequenceFileInputFormat` and `RCFileInputFormat`. Other formats can also be implemented and specified in the query itself. For example,

```
CREATE TABLE dummytable(key INT, value STRING)
STORED AS
INPUTFORMAT
org.apache.hadoop.mapred.SequenceFileInputFormat
OUTPUTFORMAT
org.apache.hadoop.mapred.SequenceFileOutputFormat
```

# System Architecture and Component

### Components

* **Metastore** - Stores system catalog and metadata about tables, columns and partitions.
* **Driver** - Manages life cycle of a HiveQL statement as it moves through Hive.
* **Query Compiler** - Compiles query into a directed acyclic graph of MapReduce tasks.
* **Execution Engine** - Execute tasks produced by the compiler in proper dependency order.
* **Hive Server** - Provides a thrift interface and a JDBC/ODBC server.
* **Client components** - CLI, web UI, and JDBC/ODBC driver.
* **Extensibility Interfaces** - Interfaces for SerDe, ObjectInspector, UDF (User Defined Function) and UDAF (User-Defined Aggregate Function).

### Life Cycle of a query

The query is submitted via CLI/web UI/any other interface. This query goes to the compiler and undergoes parse, type-check and semantic analysis phases using the metadata from Metastore. The compiler generates a logical plan which is optimized by the rule-based optimizer and an optimized plan in the form of DAG of MapReduce and hdfs tasks is generated. The execution engine executes these tasks in the correct order using Hadoop.

### Metastore
It stores all information about the tables, their partitions, schemas, columns and their types, etc. Metastore runs on traditional RDBMS (so that latency for metadata query is very small) and uses an open source ORM layer called DataNuclues. Matastore is backed up regularly. To make sure that the system scales with the number of queries, no metadata queries are made the mapper/reducer of a job. Any metadata needed by the mapper or the reducer is passed through XML plan files that are generated by the compiler.

### Query Compiler

Hive Query Compiler works similar to traditional database compilers.
* Antlr is used to generate the Abstract Syntax Tree (AST) of the query.
* A logical plan is created using information from the metastore. An intermediate representation called query block (QB) tree is used when transforming AST to operator DAG. Nested queries define the parent-child relationship in QB tree.
* Optimization logic consists of a chain of transformation operations such that output from one operation is input to next operation. Each transformation comprises of a **walk** on operator DAG. Each visited **node** in the DAG is tested for different **rules**. If any rule is satisfied, its corresponding **processor** is invoked. **Dispatcher** maintains a mapping fo different rules and their processors and does rule matching. **GraphWalker** manages the overall traversal process.
* Logical plan generated in the previous step is split into multiple MapReduce and hdfs tasks. Nodes in the plan correspond to physical operators and edges represent the flow of data between operators.

### Optimisations

* **Column Pruning** - Only the columns needed in the query processing are projected.
* **Predicate Pushdown** - Predicates are pushed down to the scan so that rows are filtered as early as possible.
* **Partition Pruning** - Predicates on partitioned columns are used to prune out files of partitions that do not satisfy the predicate.
* **Map Side Joins** - In case the tables involved in the join are very small, the tables are replicated in all the mappers and the reducers.
* **Join Reordering** - Large tables are streamed and not materialized in-memory in the reducer to reduce memory requirements.

Some optimizations are not enabled by default but can be activated by setting certain flags. These include:

* Repartitioning data to handle skew in `GROUP BY` processing.This is achieved by performing `GROUP BY` in two MapReduce stages - first where data is distributed randomly to the reducers and partial aggregation is performed. In the second stage, these partial aggregations are distributed on GROUP BY columns to different reducers.
* Hash bases partial aggregations in the mappers to reduce the data that is sent by the mappers to the reducers which help in reducing the amount of time spent in sorting and merging the resulting data.

### Execution Engine

Execution Engine executes the tasks in order of their dependencies. A MapReduce task first serializes its part of the plan into a plan.xml file. This file is then added to the job cache and mappers and reducers are spawned to execute relevant sections of the operator DAG. The final results are stored to a temporary location and then moved to the final destination (in the case of say `INSERT INTO` query).

# Future Work

The paper mentions the following areas for improvements:

* HiveQL should subsume SQL syntax.
* Implementing a cost-based optimizer and using adaptive optimization techniques.
* Columnar storage to improve scan performance.
* Enhancing JDBS/ODBC drivers for Hive to integrate with other BI tools.
* Multi-query optimization and generic n-way join in a single MapReduce job.