Translating SQL to MapReduce

Translating SQL to MapReduce

This section provides information on how to translate SQL to a MapReduce environment.

If you have existing SQL queries and need materialized reductions, you can follow the guidelines on this page for translating SQL queries to MapReduce views.

Important: If you don't need materialized reductions, we recommend that you reformulate your SQL queries to use N1QL rather than MapReduce views.

Here's an example of a SQL statement that you might want to translate:

SELECT fieldlist FROM table
    WHERE condition
    GROUP BY groupfield
    ORDER BY orderfield
    LIMIT limitcount OFFSET offsetcount

The different elements within the SQL statement affect how a view is written. The following table describes how each clause is handled in a MapReduce view:

SQL clause MapReduce
SELECT fieldlist The field list within the SQL statement affects either the corresponding key or value within the map() function, depending on whether you are also selecting or reducing your data.
FROM table There are no table compartments within Couchbase Server and you cannot perform views across more than one bucket boundary. However, if you are using a type field within your documents to identify different record types, then you may want to use the map() function to make a selection.
WHERE condition The map() function and the data generated into the view key directly affect how you can query, and therefore how selection of records takes place.
GROUP BY groupfield Grouping within SQL is handled within views through the use of the reduce() function.
ORDER BY orderfield The order of record output within a view is directly controlled by the key specified during the map() function phase of the view generation.
LIMIT limitcount OFFSET offsetcount There are a number of different paging strategies available within the MapReduce and views mechanism.

View map() function, reduce() function, selection parameters and other miscellaneous parameters have the following interactions:

SQL clause View key View value map() function reduce() function Selection parameters Other parameters
SELECT fields Yes Yes Yes No: with GROUP BY and SUM() or COUNT() functions only No No
FROM table No No Yes No No No
WHERE clause Yes No Yes No Yes No
ORDER BY field Yes No Yes No No descending
LIMIT x OFFSET y No No No No No limit, skip
GROUP BY field Yes Yes Yes Yes No No

Within SQL, the basic query structure can be used for a multitude of different queries. For example, the structure SELECT fieldlist FROM table WHERE xxxx can be used with a number of different clauses.

Within MapReduce and Couchbase Server, you might need to create multiple views to handle different types of queries. For example, performing a query on all the blog posts on a specific date needs a very different view definition than one needed to support selection by the author.

Translating SQL SELECT to MapReduce

The field selection within an SQL query can be translated into a corresponding view definition by:

  • Adding the fields to the emitted key if the value is also used for selection in a WHERE clause.
  • Into the emitted value if the data is separate from the required query parameters.

For example, to get the sales data by country from each stored document using the following map() function:

function(doc, meta) {
  emit([doc.city, doc.sales], null);
}

If you want to output information that can be used within a reduce function, this should be specified in the value generated by each emit() call. For example, to reduce the sales figures the above map() function could be rewritten as:

function(doc, meta) {
  emit(doc.city, doc.sales);
}

In essence this does not produce significantly different output (albeit with a simplified key), but the information can now be reduced using the numerical value.

If you want to output data or field values completely separate to the query values, then these fields can be explicitly output within the value portion of the view. For example:

function(doc, meta) {
  emit(doc.city, [doc.name, doc.sales]);
}

If the entire document for each item is required, load the document data after the view has been requested through the client library. For more information on this parameter and the performance impact.

Within a SELECT statement it is common practice to include the primary key for a given record in the output. Within a view this is not normally required, since the document ID that generated each row is always included within the view output.

Translating SQL WHERE to MapReduce

The WHERE clause within an SQL statement forms the selection criteria for choosing individual records. Within a view, the ability to query the data is controlled by the content and structure of the key generated by the map() function.

In general, for each WHERE clause you need to include the corresponding field in the key of the generated view, and then use the key, keys or startkey / endkey combinations to indicate the data you want to select. The complexity occurs when you need to perform queries on multiple fields. There are a number of different strategies that you can use for this.

The simplest way is to decide whether you want to be able to select a specific combination, or whether you want to perform range or multiple selections. For example, using our recipe database, if you want to select recipes that use the ingredient ‘carrot’ and have a cooking time of exactly 20 minutes, then you can specify these two fields in the map() function:

function(doc, meta)
{
  if (doc.ingredients)
  {
    for(i=0; i < doc.ingredients.length; i++)
    {
      emit([doc.ingredients[i].ingredient, doc.totaltime], null);
    }
  }
}

Then the query is an array of the two selection values:

?key=["carrot",20]

This is equivalent to the SQL query:

SELECT recipeid FROM recipe JOIN ingredients on ingredients.recipeid = recipe.recipeid
    WHERE ingredient = 'carrot' AND totaltime = 20

If, however, you want to perform a query that selects recipes containing carrots that can be prepared in less than 20 minutes, a range query is possible with the same map() function:

?startkey=["carrot",0]&endkey=["carrot",20]

This works because of the sorting mechanism in a view, which outputs the information sequentially with carrots first and a sequential number.

More complex queries though are more difficult. What if you want to select recipes with carrots and rice, still preparable in under 20 minutes?

A standard map() function like that above wont work. A range query on both ingredients will list all the ingredients between the two. There are a number of solutions available to you. First, the easiest way to handle the timing selection is to create a view that explicitly selects recipes prepared within the specified time. Here's an example:

function(doc, meta)
{
  if (doc.totaltime <= 20)
  {
    ...
  }
}

Although this approach seems to severely limit your queries, remember you can create multiple views, so you could create one for 10 minutes, one for 20 minutes, one for 30 minutes, or whatever intervals you select. It’s unlikely that anyone will really want to select recipes that can be prepared in 17 minutes, so such granular selection is overkill.

The multiple ingredients is more difficult to solve. One way is to use the client to perform two queries and merge the data. For example, the map() function:

function(doc, meta)
{
  if (doc.totaltime &amp;&amp; doc.totaltime <= 20)
  {
    if (doc.ingredients)
    {
      for(i=0; i < doc.ingredients.length; i++)
      {
        emit(doc.ingredients[i].ingredient, null);
      }
    }
  }
}

Two queries, one for each ingredient can easily be merged by performing a comparison and count on the document ID output by each view.

The alternative is to output the ingredients twice within a nested loop, like this:

function(doc, meta)
{
  if (doc.totaltime &amp;&amp; doc.totaltime <= 20)
  {
    if (doc.ingredients)
    {
      for (i=0; i < doc.ingredients.length; i++)
      {
        for (j=0; j < doc.ingredients.length; j++)
        {
          emit([doc.ingredients[i].ingredient, doc.ingredients[j].ingredient], null);
        }
      }
    }
  }
}

Now you can perform an explicit query on both ingredients:

?key=["carrot","rice"]

If you really want to support flexible cooking times, then you can also add the cooking time:

function(doc, meta)
{
  if (doc.ingredients)
  {
    for (i=0; i < doc.ingredients.length; i++)
    {
      for (j=0; j < doc.ingredients.length; j++)
      {
        emit([doc.ingredients[i].ingredient, doc.ingredients[j].ingredient, recipe.totaltime], null);
      }
    }
  }
}

And now you can support a ranged query on the cooking time with the two ingredient selection:

?startkey=["carrot","rice",0]&key=["carrot","rice",20]

This would be equivalent to:

SELECT recipeid FROM recipe 
JOIN ingredients i1 ON ingredients.recipeid = recipe.recipeid
JOIN ingredients i2 ON ingredients.recipeid = recipe.recipeid
WHERE (i1.ingredient IN ('carrot',rice')) AND 
(i2.ingredient IN ('carrot',rice')) AND 
(totaltime < 20 AND totaltime > 0)

Translating SQL ORDER BY to MapReduce

The ORDER BY clause within SQL controls the order of the records that are output. Ordering within a view is controlled by the value of the key. However, the key also controls and supports the querying mechanism.

In SELECT statements where there is no explicit WHERE clause, the emitted key can entirely support the sorting you want. For example, to sort by the city and salesman name, the following map() will achieve the required sorting:

function(doc, meta)
{
   emit([doc.city, doc.name], null)
}

If you need to query on a value, and that query specification is part of the order sequence then you can use the format above. For example, if the query basis is city, then you can extract all the records for ‘London’ using the above view and a suitable range query:

?endkey=["London\u0fff"]&startkey=["London"]

However, if you want to query the view by the salesman name, you need to reverse the field order in the emit() statement:

function(doc, meta)
{
   emit([doc.name,doc.city],null)
}

Now you can search for a name while still getting the information in city order.

The order the output can be reversed (equivalent to ORDER BY field DESC ) by using the descending query parameter.

Translating SQL GROUP BY to MapReduce

The GROUP BY parameter within SQL provides summary information for a group of matching records according to the specified fields, often for use with a numeric field for a sum or total value, or count operation.

For example:

SELECT name,city,SUM(sales) FROM sales GROUP BY name,city

This query groups the information by the two fields ‘name’ and ‘city’ and produces a sum total of these values. To translate this into a MapReduce function within Couchbase Server:

  • From the list of selected fields, identify the field used for the calculation. These will need to be exposed within the value emitted by the map() function.

  • Identify the list of fields in the GROUP BY clause. These will need to be output within the key of the map() function.

  • Identify the grouping function, for example SUM() or COUNT(). You will need to use the equivalent built-in function, or a custom function, within the reduce() function of the view.

For example, in the above case, the corresponding map function can be written as map() :

function(doc, meta)
{
   emit([doc.name,doc.city],doc.sales);
}

This outputs the name and city as the key, and the sales as the value. Because the SUM() function is used, the built-in reduce() function _sum can be used.

An example of this MapReduce combination can be seen _sum.

More complex grouping operations may require a custom reduce function.

Translating SQL LIMIT and OFFSET

Within SQL, the LIMIT and OFFSET clauses to a given query are used as a paging mechanism. For example, you might use:

SELECT recipeid,title FROM recipes LIMIT 100

To get the first 100 rows from the database, and then use the OFFSET to get the subsequent groups of records:

SELECT recipeid,title FROM recipes LIMIT 100 OFFSET 100

With Couchbase Server, the limit and skip parameters when supplied to the query provide the same basic functionality:

?limit=100&skip=100

Performance for high values of skip can be affected.