Introduction to MongoDB $strLenCP Operator

The $strLenCP operator is an aggregation operator in MongoDB used to calculate the number of characters in a string. Unlike $strLenBytes, $strLenCP takes into account Unicode characters and can correctly calculate the number of characters in a UTF-8 encoded string.

Syntax

The syntax for the $strLenCP operator is as follows:

{ $strLenCP: <expression> }

Here, <expression> represents the string expression for which to calculate the character count. This can be a field name, a text string, a variable, etc.

Use cases

Strings are a common data type in MongoDB, and in practical application scenarios, we may need to perform operations such as calculating, filtering, and sorting strings by length. However, when dealing with UTF-8 encoded strings, the $strLenBytes operator cannot accurately calculate the number of characters because some characters in UTF-8 encoding occupy multiple bytes. In this case, the $strLenCP operator can come in handy.

For example, suppose we have a collection that stores comments with Emoji emoticons. We want to count the number of characters in each comment to facilitate filtering and sorting of comments.

Examples

Example 1

Suppose we have a collection called comments that stores some comments. Each comment has two fields: _id represents the unique identifier of the comment, and content represents the content of the comment.

Now, we want to calculate the number of characters in each comment and sort them in descending order by the number of characters. The following aggregation pipeline can be used:

db.comments.aggregate([
  {
    $project: {
      _id: 1,
      content: 1,
      charCount: { $strLenCP: "$content" }
    }
  },
  {
    $sort: { charCount: -1 }
  }
])

In the above aggregation pipeline, the $project operator is used to calculate the number of characters in each comment and store the result in a new field called charCount. Then, the $sort operator is used to sort the comments in descending order by the number of characters.

Next, let’s look at another example. Suppose we have a collection called users that stores some user information. Each user has two fields: _id represents the unique identifier of the user, and name represents the name of the user.

Now, we want to query all users whose name is at least 4 characters long. The following aggregation pipeline can be used:

db.users.aggregate([
  {
    $match: {
      $expr: { $gte: [{ $strLenCP: "$name" }, 4] }
    }
  }
])

In the above aggregation pipeline, the $match operator is used to filter all users whose name is at least 4 characters long. Specifically, the $strLenCP operator is used to obtain the number of characters in the name field of the document. If the number of characters is greater than or equal to 4, the document is retained. Then, the $project operator is used to return the name and character count fields in the document.

Example 2

Suppose we have the following documents:

{ "_id": 1, "name": "John" }
{ "_id": 2, "name": "Jane" }
{ "_id": 3, "name": "Mike" }
{ "_id": 4, "name": "Lily" }

We can use the following aggregation pipeline:

db.users.aggregate([
  {
    $match: {
      $expr: {
        $gte: [{ $strLenCP: "$name" }, 4]
      }
    }
  },
  {
    $project: {
      name: 1,
      name_length: { $strLenCP: "$name" }
    }
  }
])

This aggregation pipeline will return the following result:

{ "_id": 1, "name": "John", "name_length": 4 }
{ "_id": 2, "name": "Jane", "name_length": 4 }
{ "_id": 3, "name": "Mike", "name_length": 4 }
{ "_id": 4, "name": "Lily", "name_length": 4 }

In this example, the $strLenCP operator is used to retrieve the number of characters in the name field of each document and compare it with the number 4 to determine if the name length is greater than or equal to 4 characters. Then, the $project operator is used to return the name and name_length fields of the document.

Example 3

Here is another example using the $strLenCP operator:

Suppose we have the following documents:

{ "_id": 1, "name": "John Doe" }
{ "_id": 2, "name": "Jane Smith" }
{ "_id": 3, "name": "Mike Johnson" }
{ "_id": 4, "name": "Lily Wang" }

We can use the following aggregation pipeline:

db.users.aggregate([
  {
    $project: {
      name: 1,
      first_name_length: {
        $strLenCP: { $arrayElemAt: [{ $split: ["$name", " "] }, 0] }
      },
      last_name_length: {
        $strLenCP: { $arrayElemAt: [{ $split: ["$name", " "] }, 1] }
      }
    }
  }
])

This aggregation pipeline will return the following result:

{ "_id": 1, "name": "John Doe", "first_name_length": 4, "last_name_length": 3 }
{ "_id": 2, "name": "Jane Smith", "first_name_length": 4, "last_name_length": 5 }
{ "_id": 3, "name": "Mike Johnson", "first_name_length": 4, "last_name_length": 7 }
{ "_id": 4, "name": "Lily Wang", "first_name_length": 4, "last_name_length": 4 }

Conclusion

The $strLenCP operator is a string length aggregation operator in MongoDB that returns the length of a string to the user. Unlike the $strLenBytes operator, the $strLenCP operator considers the Unicode characters, so the length of a string that contains non-ASCII characters such as Chinese characters may be greater than the number of bytes in the string. In practical application scenarios, the $strLenCP operator can be used to achieve various string length-related operations based on specific requirements.