Skip to content

Missing DefaultDimensionSpec in Druid query when querying with multiple Lookup dimensions. #716

@upendrareddy

Description

@upendrareddy

When we query Maha with multiple lookup dimensions in the select fields we are seeing some of the fields returned as nulls in the results. The underlying druid query issued did not have these lookups listed in the default dimension specs.

To elaborate further
The issued maha query is of the following format.
Maha Query

{
  "cube": "test_cube",
  "rowsPerPage": 1000,
  "selectFields": [
    {
      "field": "Adunit ID"
    },
   {
      "field": "Adunit Name"  //Lookup based on Adunit ID
    },
    {
      "field": "Adgroup ID"
    },
   {
      "field": "Adgroup Name" //Lookup based on Adgroup ID
    },
    {
      "field": "Adserver Requests"
    }
  ],
  "filterExpressions": [
    {
      "field": "Publisher ID",
      "operator": "=",
      "value": "xxxxxxxxxAAAAAABBBBBBBBBB"
    },
    {
      "field": "Day",
      "operator": "Between",
      "from": "2020-09-17",
      "to": "2020-09-17"
    }
  ]
}

Output

{
  "header": {
    "cube": "test_cube",
    "fields": [
      {
        "fieldName": "Adunit ID",
        "fieldType": "DIM"
      },
      {
        "fieldName": "Adunit Name",
        "fieldType": "DIM"
      },
      {
        "fieldName": "Adgroup ID",
        "fieldType": "DIM"
      },
      {
        "fieldName": "Adgroup Name",
        "fieldType": "DIM"
      },
      {
        "fieldName": "Adserver Requests",
        "fieldType": "FACT"
      }
    ],
    "maxRows": 1000,
    "debug": {}
  },
  "rows": [
    [
      "dddddddddddAdunitI1dddddddddddddd",
      "dddddddddddAdunitI1 Name dddddddddddddd",
      null,   // _Missing adgroup_id_
      "dddddddddddAdgroup1 Name dddddddddddddd",
      0
    ],
    [
      "dddddddddddAdunitId2ddddddddddddd",
      "dddddddddddAdunitId2 Name ddddddddddddd",
      null,   // _Missing adgroup_id_
      "dddddddddddAdgroup2 Name dddddddddddddd",
      1
    ]
  ],
  "curators": {}
}

Druid Query Created by Maha

{
  "queryType": "groupBy",
  "dataSource": {
    "type": "table",
    "name": "test_cube"
  },
  "intervals": {
    "type": "intervals",
    "intervals": [
      "2020-09-17T00:00:00.000Z/2020-09-18T00:00:00.000Z"
    ]
  },
  "virtualColumns": [],
  "filter": {
    "type": "and",
    "fields": [
      {
        "type": "or",
        "fields": [
          {
            "type": "selector",
            "dimension": "__time",
            "value": "2020-09-17",
            "extractionFn": {
              "type": "timeFormat",
              "format": "YYYY-MM-dd",
              "timeZone": "UTC",
              "granularity": {
                "type": "none"
              },
              "asMillis": false
            }
          }
        ]
      },
      {
        "type": "selector",
        "dimension": "pubId",
        "value": "xxxxxxxxxAAAAAABBBBBBBBBB"
      }
    ]
  },
  "granularity": {
    "type": "all"
  },
  "dimensions": [     // No adgroup_id added here. 
    {
      "type": "default",
      "dimension": "adunitId",
      "outputName": "Adunit ID",
      "outputType": "STRING"
    },
    {
      "type": "extraction",
      "dimension": "adgroupId",
      "outputName": "Adgroup Name",
      "outputType": "STRING",
      "extractionFn": {
        "type": "registeredLookup",
        "lookup": "adgroup_names",
        "retainMissingValue": false,
        "replaceMissingValueWith": "UNKNOWN"
      }
    },
    {
      "type": "extraction",
      "dimension": "adunitId",
      "outputName": "Adunit Name",
      "outputType": "STRING",
      "extractionFn": {
        "type": "registeredLookup",
        "lookup": "adunit_names",
        "retainMissingValue": false,
        "replaceMissingValueWith": "UNKNOWN"
      }
    }
  ],
  "aggregations": [
    {
      "type": "longSum",
      "name": "Adserver Requests",
      "fieldName": "adserverRequests"
    }
  ],
  "postAggregations": [],
  "limitSpec": {
    "type": "default",
    "columns": [],
    "limit": 10000000
  },
  "context": {
    "groupByStrategy": "v2",
    "applyLimitPushDown": "false",
    "implyUser": "internal_user",
    "priority": 10,
    "userId": "internal_user",
    "uncoveredIntervalsLimit": 1,
    "groupByIsSingleThreaded": true,
    "timeout": 900000,
    "queryId": "9292389f-2a7f-4e12-a39a-6f727097ab92"
  },
  "descending": false
}

Upon debugging further I stumbled upon the variable factRequestCols(Set of Strings) at https://github.com/yahoo/maha/blob/master/core/src/main/scala/com/yahoo/maha/core/query/druid/DruidQueryGenerator.scala#L353 which is being passed on to method at https://github.com/yahoo/maha/blob/master/core/src/main/scala/com/yahoo/maha/core/query/druid/DruidQueryGenerator.scala#L381 where druid queries dimension specs are being created in getDimensions method based on factRequestCols passed.

I am not quite sure I understand the logic of factRequestCols set creation here but adgroup_id dimension is not getting included in the resulting set because of which it is not getting added to Druid query dimension spec as well.

I locally overrode the code by passing queryContext.factBestCandidate.requestCols to getDimensions method at https://github.com/yahoo/maha/blob/master/core/src/main/scala/com/yahoo/maha/core/query/druid/DruidQueryGenerator.scala#L381 and it fixed the issue and started populating the dimension spec in druid query as well as had adgroup_id values in the resulting out.

Output after the change

{
  "header": {
    "cube": "platform_performance_cube",
    "fields": [
      {
        "fieldName": "Adunit ID",
        "fieldType": "DIM"
      },
      {
        "fieldName": "Adunit Name",
        "fieldType": "DIM"
      },
      {
        "fieldName": "Adgroup ID",
        "fieldType": "DIM"
      },
      {
        "fieldName": "Adgroup Name",
        "fieldType": "DIM"
      },
      {
        "fieldName": "Adserver Requests",
        "fieldType": "FACT"
      }
    ],
    "maxRows": 1000,
    "debug": {}
  },
  "rows": [
    [
      "dddddddddddAdunitI1dddddddddddddd",
      "dddddddddddAdunitI1 Name dddddddddddddd",
      "dddddddddddAdgroup1dddddddddddddd",
      "dddddddddddAdgroup1 Name dddddddddddddd",
      0
    ],
    [
      "dddddddddddAdunitId2ddddddddddddd",
      "dddddddddddAdunitId2 Name ddddddddddddd",
      "dddddddddddAdgroup2dddddddddddddd",
      "dddddddddddAdgroup2 Name dddddddddddddd",
      1
    ]
  ],
  "curators": {}
}

Could you please help me with the context of factRequestCols set creation and also let me know if the logic of the factRequestCols set creation or anything else needs to be changed to include the missing dimensions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions