Introduction to MongoDB $indexOfCP Operator

$indexOfCP is a string operator in MongoDB used to find the starting position of a substring within a string. “CP” stands for Unicode code point, and this operator can handle strings containing Unicode characters encoded in UTF-8.

Syntax

The syntax for the $indexOfCP operator is as follows:

{ $indexOfCP: { <string-expression>, <substring-expression>, [<start-index>] } }

Here, <string-expression> is a string expression that specifies the string in which to find the substring, <substring-expression> is a string expression that specifies the substring to find, and <start-index> is an optional number that specifies the starting position to search for the substring. If <start-index> is not specified, the search starts from the first character of the string.

Use Cases

The $indexOfCP operator is useful for finding the position of a substring within a string that contains Unicode characters. MongoDB uses UTF-8 encoding by default, so if you use the $indexOf operator for string matching, you may get incorrect results. If you need to match strings containing Unicode characters, you should use the $indexOfCP operator.

Examples

Suppose you have an array of strings containing Unicode characters and you want to find all the elements that contain the substring “text” and return the starting position of the substring within each element. You can use the following aggregation pipeline:

db.collection.aggregate([
  {
    $match: {
      $expr: {
        $gt: [{ $indexOfCP: ["$text", "text"] }, -1]
      }
    }
  },
  {
    $project: {
      text: 1,
      index: { $indexOfCP: ["$text", "text"] }
    }
  }
])

In the example above, the $match stage uses the $expr expression to filter elements that contain the substring “text”, and the $project stage uses the $indexOfCP operator to get the starting position of the substring within each element.

Suppose you have the following documents in the collection:

{ "_id": 1, "text": "This is a paragraph of text" }
{ "_id": 2, "text": "This is another paragraph of text" }
{ "_id": 3, "text": "This text contains some text" }
{ "_id": 4, "text": "This text does not contain the target text" }

After running the aggregation pipeline above, you get the following results:

{ "_id": 1, "text": "This is a paragraph of text", "index": 5 }
{ "_id": 3, "text": "This text contains some text", "index": 2 }

Here, "_id" represents the ID of the document, "text" represents the string in the document, and "index" represents the starting position of the substring "text" in the string.

Conclusion

In conclusion, the $indexOfCP operator in MongoDB can be used to find the position of a substring within a string and return the starting position of the substring. It can handle Unicode characters and return correct results. Using the $indexOfCP operator can conveniently process strings containing Unicode characters and quickly locate the desired information. However, it is important to note that the case of the string must match when using this operator, or it will not work properly.