Full join query not utilizing index

kshastry · June 16, 2021, 6:18pm

Trying to perform Full join on 2 tables.
Table 1 has 1.1 Billion Records
Table 2 has 180 Million Records

Table Structure is,

CREATE TABLE TABLE1(
COLUMN1 VARCHAR(20),
COLUMN2 VARCHAR(20),
COLUMN3 VARCHAR(20),
COLUMN4 VARCHAR(20),
COLUMN5 VARCHAR(20),
COLUMN6 VARCHAR(20),
CREATE_DATE DATE,
KEY (COLUMN1,COLUMN2,COLUMN3) USING CLUSTERED COLUMNSTORE,
SHARD KEY (COLUMN1,COLUMN2));

CREATE TABLE TABLE2(
COLUMN1 VARCHAR(20),
COLUMN2 VARCHAR(20),
COLUMN3 VARCHAR(20),
STAT1 INT,
STAT2 INT,
STAT3 INT,
STAT4 INT,
CREATE_DATE DATE,
KEY (COLUMN1,COLUMN2,COLUMN3) USING CLUSTERED COLUMNSTORE,
SHARD KEY (COLUMN1,COLUMN2));

When this query is run, filter is not done before doing the join,

SELECT
(CASE WHEN t1.COLUMN3 is null then t2.COLUMN3 else t1.COLUMN3) AS PIVOT_1,
SUM(STAT1),
SUM(STAT2)
FROM
TABLE1 t1 FULL JOIN TABLE t2 ON
(t1.COLUMN1 = t2.COLUMN1 and t1.COLUMN2 = t2.COLUMN2)
WHERE
(t1.COLUMN1 = “STRING” or t2.COLUMN1 is null) or (t2.COLUMN1 = “STRING” and t1.COLUMN1 is null) or (t1.COLUMN1 = “STRING” and t2.COLUMN1 = “STRING”)
GROUP BY PIVOT1;

Sample profile screenshot:

Whereas if i manually filter the tables and then do a join, it is very fast. Should this filtering not be done implicitly by the singlestore.

SELECT
(CASE WHEN t1.COLUMN3 is null then t2.COLUMN3 else t1.COLUMN3) AS PIVOT_1,
SUM(STAT1),
SUM(STAT2)
FROM
(SELECT * FROM TABLE1 where t1.COLUMN1 = “STRING”) t1
FULL JOIN (SELECT * FROM TABLE1 where t1.COLUMN1 = “STRING”) t2
ON (t1.COLUMN1 = t2.COLUMN1 and t1.COLUMN2 = t2.COLUMN2)
GROUP BY PIVOT_1;

Sample profile screenshot:

jack · June 17, 2021, 9:08pm

It is not in general correct to move the filters (e.g. t1.COLUMN1 = “STRING”) before the outer join, because that can change the meaning/results of the query. Consider an example of a row in t1 with COLUMN1 = “xyz” that doesn’t match anything in t2. In the original query, this will output a join row with t1.COLUMN1 = “xyz”, t2.COLUMN1 null. But in your second version of the query, this row will be filtered out prior to the join, and the query doesn’t return this row - so the two versions can return different results.

So the database does have to read all the rows of table1, even the ones that don’t have COLUMN1 = “STRING”.

Looking at your profile, it seems like your two versions most likely do return different results - the first version has many more rows than the second. This makes sense because the second version basically is leaving out a bunch of the OUTER results of the join.