Translating SQL to map/reduce

Translating SQL to map/reduce

This section provides information on how to translate SQL to a map/reduce environment.

SELECT fieldlist FROM table \ WHERE condition \ GROUP BY groupfield \ ORDER BY orderfield \ LIMIT limitcount OFFSET offsetcount 

The different elements within the source statement affect how a view is written in the following ways:

  • SELECT fieldlist

The field list within the SQL statement affects either the corresponding key or value within the map() function, depending on whether you are also selecting or reducing your data.

  • FROM table

There are no table compartments within Couchbase Server, and you cannot perform views across more than one bucket boundary. However, if you are using a type field within your documents to identify different record types, then you may want to use the map() function to make a selection.

  • WHERE condition

The map() function and the data generated into the view key directly affect how you can query, and therefore how the selection of records takes place.

  • ORDER BY orderfield

The key specified during the map() function phase of the view generation directly controls the order of record output within a view.

  • LIMIT limitcount OFFSET offsetcount

There are some different paging strategies available within the map/reduce and views mechanism.

  • GROUP BY groupfield

Grouping within SQL is handled within views through the use of the reduce() function.

The interaction between the view map() function, reduce() function, selection parameters and other miscellaneous parameters according to the table below:

SQL Statement Fragment View Key View Value map() Function reduce() Function Selection Parameters Other Parameters
SELECT fields Yes Yes Yes No: with GROUP BY and SUM() or COUNT() functions only No No
FROM table No No Yes No No No
WHERE clause Yes No Yes No Yes No
ORDER BY field Yes No Yes No No descending
LIMIT x OFFSET y No No No No No limit , skip
GROUP BY field Yes Yes Yes Yes No No

Within SQL, the basic query structure can be used for a multitude of different queries. For example, the same SELECT fieldlist FROM table WHERE xxxx can be used with different clauses.

Within map/reduce and Couchbase Server, multiple views may be needed to be created to handled different query types. For example, performing a query on all the blog posts on a specific date will need a very different view definition than one needed to support selection by the author.

Translating SQL SELECT to map/reduce

The field selection within an SQL query can be translated into a corresponding view definition. Add the fields to the emitted key if the value is also used for selection in a WHERE clause, or into the emitted value if the data is separate from the required query parameters.

For example, to get the sales data by country from each stored document using the following map() function:

function(doc, meta) { emit([doc.city, doc.sales], null); } 

If you want to output information and use it within a reduce function, specify it in the value generated by each emit() call. For example, to reduce the sales figures, re-write the above map() function:

function(doc, meta) { emit(doc.city, doc.sales); } 

The output is not significantly different albeit with a simplified key, but the information can now be reduced using the numerical value.

If you want to output data or field values separate to the query values, then these fields can be explicitly output within the value portion of the view. For example:

function(doc, meta) { emit(doc.city, [doc.name, doc.sales]); } 

If the entire document for each item is required, load the document data after the view has been requested through the client library.

Within a SELECT statement, it is common practice to include the primary key for a given record in the output. Within a view, this is not normally required since the document ID that generated each row is always included within the view output.

Translating SQL WHERE to map/reduce

The WHERE clause within an SQL statement forms the selection criteria for choosing individual records. Within a view, the ability to query the data is controlled by the content and structure of the key generated by the map() function.

In general, for each WHERE clause you need to include the corresponding field in the key of the generated view. Use the key , keys or startkey / endkey combinations to indicate the data you want to select. The complexity occurs when you need to perform queries on multiple fields. There are different strategies that you can use for this.

The simplest way is to decide whether you want to be able to select a specific combination, or whether you want to perform range or multiple selections. For example, if you want to select recipes that use the ingredient ‘carrot’ and have a cooking time of exactly 20 minutes, you can specify these two fields in the map() function:

function(doc, meta) { if (doc.ingredients) { for(i=0; i < doc.ingredients.length; i++) { emit([doc.ingredients[i].ingredient, doc.totaltime], null); } } } 

Then the query is an array of the two selection values:

?key=["carrot",20] 

The example is equivalent to the following SQL query:

SELECT recipeid FROM recipe JOIN ingredients on ingredients.recipeid = recipe.recipeid WHERE ingredient = 'carrot' AND totaltime = 20 

If, however, you want to perform a query that selects recipes containing carrots that you can prepare in less than 20 minutes, a range query is possible with the same map() function:

?startkey=["carrot",0]&endkey=["carrot",20] 
The above example works due to the sorting mechanism in a view that outputs in the information sequentially, fortunately nicely sorted with carrots first and a sequential number.

More complex queries though are more difficult. What if you want to select recipes with carrots and rice, still preparable in under 20 minutes?

A standard map() function like that above won't work. A range query on both ingredients will list all the ingredients between the two. There are a number of solutions available to you. First, the easiest way to handle the timing selection is to create a view that explicitly selects recipes prepared within the specified time. I.E:

function(doc, meta) { if (doc.totaltime <= 20) { ... } } 

Although this approach seems to limit your queries severely, remember you can create multiple views, so you could create one for 10 mins, one for 20, one for 30, or whatever intervals you select. It’s unlikely that anyone will want to select recipes that can be prepared in 17 minutes, so such granular selection is overkill.

The multiple ingredients are more difficult to solve. One way is to use the client to perform two queries and merge the data. For example, the map() function:

function(doc, meta)
{
  if (doc.totaltime && doc.totaltime <= 20)
  {
    if (doc.ingredients)
    {
      for(i=0; i < doc.ingredients.length; i++)
      {
        emit(doc.ingredients[i].ingredient, null);
      }
    }
  }
}

Two queries, one for each ingredient can easily be merged by performing a comparison and count on the document ID output by each view.

The alternative is to output the ingredients twice within a nested loop, like this:

function(doc, meta)
{
  if (doc.totaltime && doc.totaltime <= 20)
  {
    if (doc.ingredients)
    {
      for (i=0; i < doc.ingredients.length; i++)
      {
        for (j=0; j < doc.ingredients.length; j++)
        {
          emit([doc.ingredients[i].ingredient, doc.ingredients[j].ingredient], null);
        }
      }
    }
  }
}

Now you can perform an explicit query on both ingredients:

?key=["carrot","rice"]

If you really want to support flexible cooking times, then you can also add the cooking time:

function(doc, meta) { if (doc.ingredients) { for (i=0; i < doc.ingredients.length; i++) { for (j=0; j < doc.ingredients.length; j++) { emit([doc.ingredients[i].ingredient, doc.ingredients[j].ingredient, recipe.totaltime], null); } } } } 

And now you can support a ranged query on the cooking time with the two ingredient selection:

?startkey=["carrot","rice",0]&key=["carrot","rice",20] 

This would be equivalent to:

SELECT recipeid FROM recipe JOIN ingredients on ingredients.recipeid = recipe.recipeid WHERE (ingredient = 'carrot' OR ingredient = 'rice') AND totaltime = 20 

Translating SQL ORDER BY to map/reduce

The ORDER BY clause within SQL controls the order of the records that are output. Ordering within a view is controlled by the value of the key. However, the key also controls and supports the querying mechanism.

In SELECT statements where there is no explicit WHERE clause, the emitted key can entirely support the sorting you want. For example, to sort by the city and salesman name, the following map() will achieve the required sorting:

function(doc, meta) { emit([doc.city, doc.name], null) } 

If you need to query on a value, and that query specification is part of the order sequence then you can use the format above. For example, if the query basis is city, then you can extract all the records for ‘London’ using the above view and a suitable range query:

?endkey=["London\u0fff"]&startkey=["London"] 

However, if you want to query the view by the salesman name, you need to reverse the field order in the emit() statement:

function(doc, meta) { emit([doc.name,doc.city],null) } 

Now you can search for a name while still getting the information in city order.

The order the output can be reversed (equivalent to ORDER BY field DESC ) by using the descending query parameter.

Translating SQL GROUP BY to map/reduce

The GROUP BY parameter within SQL provides summary information for a group of matching records according to the specified fields, often for use with a numeric field for a sum or total value, or count operation.

For example:

SELECT name,city,SUM(sales) FROM sales GROUP BY name,city 

This query groups the information by the two fields ‘name’ and ‘city’ and produces a sum total of these values. To translate this into a map/reduce function within Couchbase Server:

  • From the list of selected fields, identify the field used for the calculation. These will need to be exposed within the value emitted by the map() function.

  • Identify the list of fields in the GROUP BY clause. These will need to be output within the key of the map() function.

  • Identify the grouping function, for example SUM() or COUNT() . You will need to use the equivalent built-in function, or a custom function, within the reduce() function of the view.

For example, in the above case, the corresponding map function can be written as map() :

function(doc, meta) { emit([doc.name,doc.city],doc.sales); } 

This outputs the name and city as the key, and the sales as the value. Because the SUM() function is used, the built-in reduce() function _sum can be used.

An example of this map/reduce combination can be seen _sum .

More complex grouping operations may require a custom reduce function.

Translating SQL LIMIT and OFFSET

Within SQL, the LIMIT and OFFSET clauses to a given query are used as a paging mechanism. For example, you might use:

SELECT recipeid,title FROM recipes LIMIT 100 

To get the first 100 rows from the database, and then use the OFFSET to get the subsequent groups of records:

SELECT recipeid,title FROM recipes LIMIT 100 OFFSET 100 

With Couchbase Server, the limit and skip parameters when supplied to the query provide the same basic functionality:

?limit=100&skip=100 

Performance for high values of skip can be affected.