Introduction to MongoDB $sampleRate Operator

The $sampleRate operator is a new operator introduced in MongoDB version 4.2. It is used to randomly sample documents from a collection and return a certain percentage of documents. Unlike the $sample operator, the number of documents returned by the $sampleRate operator is dynamically calculated, rather than fixed.

Syntax

The syntax of the $sampleRate operator is as follows:

{ $sampleRate: { size: <number> } }

Here, size is the proportion of documents to return, with a range of values from 0 to 1.

Use Cases

The $sampleRate operator is commonly used in the following scenarios:

  • Random sampling of large data sets to quickly check document quality or understand data distribution.
  • Selecting a small subset of documents from the original data set for testing or validation.

Example

Assume we have a collection called users with the following documents:

{ "_id" : ObjectId("61f3b69c3f6d7f1c8d139e01"), "name" : "Alice", "age" : 20 }
{ "_id" : ObjectId("61f3b6a03f6d7f1c8d139e02"), "name" : "Bob", "age" : 30 }
{ "_id" : ObjectId("61f3b6a43f6d7f1c8d139e03"), "name" : "Charlie", "age" : 25 }
{ "_id" : ObjectId("61f3b6aa3f6d7f1c8d139e04"), "name" : "David", "age" : 35 }
{ "_id" : ObjectId("61f3b6af3f6d7f1c8d139e05"), "name" : "Eva", "age" : 40 }

We can use the $sampleRate operator to randomly select a certain proportion of documents from this collection. For example, the following query will randomly return 50% of the documents:

db.users.aggregate([{ $sampleRate: { size: 0.5 } }])

Running the above query may return the following result:

{ "_id" : ObjectId("61f3b6af3f6d7f1c8d139e05"), "name" : "Eva", "age" : 40 }
{ "_id" : ObjectId("61f3b6a43f6d7f1c8d139e03"), "name" : "Charlie", "age" : 25 }

Note that since the number of documents returned by the $sampleRate operator is dynamically calculated, running this query may return different results each time.

Conclusion

In summary, the MongoDB $sampleRate operator can control the sample proportion of the result set when performing queries to efficiently handle large data sets. The operator can be used in many scenarios such as data sampling, testing, and performance optimization. When using the $sampleRate operator, it is important to consider the reasonableness of the sampling proportion to meet the needs of data analysis while avoiding unnecessary performance impacts. In practice, the sampling proportion can be continuously adjusted according to the actual situation to achieve the best results.