hackernews client

hackerzr

a year ago

An engineer at Coralogix, a full-stack observability platform, recently shared an intriguing solution to translating SQL expressions with null semantics into OpenSearch DSL. This challenge arose while building the DataPrime query language and engine, which needed to maintain backwards compatibility with OpenSearch.

Key points:

The engineer confronted the disparity between SQL's three-valued logic (TRUE, FALSE, NULL) and OpenSearch DSL's binary filter system.

They devised a method to reduce three-valued logic to two-valued logic by considering the context of expressions, such as WHERE clauses.

The solution introduces is_false_or_null() and is_true_or_null() functions to bridge SQL and OpenSearch DSL. Boolean operators are handled by analyzing truth tables and deriving corresponding OpenSearch DSL translations. This approach enables the translation of complex SQL expressions to OpenSearch DSL while maintaining correct null semantics. The team implemented additional optimizations on the intermediate representation to enhance query efficiency.

This solution allows Coralogix users to seamlessly query both Parquet files and OpenSearch using a unified query language. The article offers valuable insights for developers working on query engines or database compatibility layers, demonstrating a creative approach to a common challenge in data querying and observability platforms.

atombender

a year ago

Using OpenSearch/Elasticsearch "exists" is on the face of it a sensible choice. But it has a couple of big pitfalls.

One potential pitfall here is if your schema supports nested objects. For example:

    {
      "user": {
        "name": "bob",
        "age": 20
      }
    }

A condition such as this:

    user != null

would get translated to:

    {"exists": "user"}

But this is a "macro" that OS/ES internally expands at query time:

  {"boolean": {"should": [
    {"exists": "user.name"},
    {"exists": "user.age"}
  ]}

As you can imagine, this can cause an explosion of field names if you have many nested fields. OS/ES has a global limit on the number of boolean clauses a single query can have, and it can cause a query to suddenly fail. Such queries are also very much not performant.

Another pitfall is that empty values do not get indexed. This includes empty objects and empty arrays. Empty values aren't null, yet they will match here because "exists" evaluates to false for them. So "exists" don't work for this edge case.

You have at least two decent choices here. One is to index a special adjacent field of type "boolean" for every field. For any data you index, you also index this metadata field:

    {
      "user": {},
      "user$notNull": true
    }

However, you end up with many fields, and OS/ES performance degrades quite a bit the more fields you have.

Another option is to have a top-level field listing the names of all fields that are empty. So something like:

    {
      "user": {},
      "$notNullFields": ["user"]
    }

Now you can use a boolean match on this special field rather than "exists", and it will be performant.

Of course, the list has to include all parent key paths as well. So for a non-empty object:

    {
      "user": {
        "name": "bob", 
        "age": 20
      }, 
      "$notNullFields": [
        "user", "user.name", "user.age"
      ]
    }

But you have the option of only including empty objects/arrays in this list, and using "exists" for all value types, assuming you know which fields can be which.

jillesvangurp

a year ago

Worth pointing out that what you outline here is not an actual nested object (in the sense of using the nested mapping) but simply field flattening. Nested is a specific thing in opensearch intended to work with lists of sub documents to model one to many relation ships. It's intended to work around the limitation that opensearch does not implement any joins. One big limitation is that child documents need to have the same mapping as their parent. Another is that there's an upper limit on the number of nested documents.

Nested of course further complicates things because you need to know to use the nested query construct as well in the dsl.

So you get something like

  {
    "query": {
        "nested": {
            "path": "users",
            "query": {
                "bool": {
                    "must": [
                        {
                            "exists": {
                                "field": "users"
                            }
                        }
                    ]
                }
            }
        }
    }
  }

(if you map users as nested instead of object)

I'm not sure how they handle this exactly, but Elasticsearch recently added sql support in their until recently proprietary and now somewhat open source again if you can live with AGPL and contributor licenses. Their documentation is a bit light on the exists semantics but I suspect it might have similar issues. There are also some limitations on the use of nested: https://www.elastic.co/guide/en/elasticsearch/reference/curr... along with a few other limitations.

Adding SQL support has been a topic for a long time. It's hard because Lucene simply is not designed to be a relational database. It's optimized for text search. So there are going to be all sorts of issues where you end up shooting yourself into the foot in terms of performance or run into things that it simply does not support at all.

Another product worth mentioning here is cratedb, which started out as a product that built ontop of Elasticsearch to add SQL support. Like Opensearch, they forked when Elastic changed the license. And since then they have changed what it does architecturally. I have no experience with their product though. So, I can't comment on the semantics.

atombender

a year ago

No, I'm not referring to the "nested" mapping type, but nested in the sense of JSON object structures.

The "nested" mapping type, in addition to its limitations, comes with a huge performance hit, and is therefore not usable as a general-purpose way of representing arbitrary nested structures.

wodenokoto

a year ago

Someone meant to write single backticks in their markdown, but somehow got triple backticks. The article is quite difficult to read with so many highlighted words getting their own line.

whalesalad

a year ago

This is also a terrible pattern in general that permeates every corner of the tech blogging world. Stop `doing` this with your `posts` so that every other word is going to be tokenized to stand out from the background text.

If you are going to do this, ensure that it is a very subtle treatment to the text.

damieng

a year ago

I wrote a LINQ provider (.Net) for Elasticsearch. Was quite the challenge dealing with the sorts of quirks and subtle differences between the two. Case sensitivity and the tokenization are also hurdles to overcome but mapping Group by to Elastics aggregates was facets back then) was the toughest part.

Hats off to Coralogix for taking up the challenge in converting SQL.

whalesalad

a year ago

Please update your blog CSS to use a monospace font/typeface for the code blocks.

user

a year ago

[deleted]

9 Comments

user