I got caught out the other day with a sensible question of representing a sql join in a set-theoretic fashion. I fumbled it, mixing two joins. I needed to go back and review! But, as I went away and thought about it, I also remembered that there are some serious assumptions in running this sort of conceptual argument that need to be explicitly stated or chaos ensues and cats and dogs will be living together. There are also some things that are not true about the sql representation in a set-theoretic manner, but rather fall into relational algebra. That said, there are advantages to be gained by *understanding* the assumptions and then using it as a simple conceptual tool to understand what you are trying to achieve.

# Set theory

So, lets start with the set theory first, and then follow it up with the SQL, and then show the relationship. Let us identify two non-empty sets, *A* and *B*. For this part, we will define *A* = {1,2,3} and *B* = {2,3,4}.

There are a number of binary operations (like arithmetical +, -, etc.) that we can perform on the sets.

Firstly, we can get *subset* out of the way. If all the members of *A* are also found in *B*, then *A* is a *subset* of *B*. denoted *A ⊆ B*. Trivially, *A ⊆ A*, so a useful term is *proper subset* which is *A ⊂ B*. So, the subsets from *A* and *B* include {2,3}, {2}, {3}, but not {1,4}, etc.

Next we have the following, along with venn diagrams to illustrate the operation:

- Union,
*A ∪ B*, is the set of all elements that are in*A*,*B*or both.*A ∪ B*= {1,2,3,4}.

- Intersection,
*A ∩ B*, is the set of all elements that are in both*A*AND*B*.*A ∩ B*= {2,3}

- Complement, sometimes also absolute complement,
*A*, is the set of all elements that are not in^{C}*A*but do exist in*U*, which is the universal set which is a superset of all the sets defined. So,*A*= {4}. Conversely,^{C}*B*= {1}. This is also called^{C}*set difference*and notated as*B\A*= {4}. This happens to be the same as the*relative complement*between two sets*A*and*B*because of the trivial two set construction that we are using here. If*U*were the positive integers, and*A*is as defined above, then*A*is a very large set indeed, in fact an infinite set.^{C}

- Relative complement,
*A*, is the set of elements that are not in^{C}∩ B*A*nor in*A ∩ B*. So,*A*= {4}.^{C}∩ B

- Symmetric difference,
*(A ∪ B) \ (A ∩ B)*or also*A ∆ B*, is the set of elements that is in either one of the sets, but not both. So,*(A ∪ B) \ (A ∩ B)*= {1,4}.

Cartesian product,

*A × B*, is the set whose elements are all the ordered pairs*(a,b)*.*A × B*= {(1,2), (1,3), (1,4), (2,2), (2,3), (2,4), (3,2), (3,3), (3,4)}. This will result in a set of*n * m*elements,*n*and*m*being the number of elements in*A*and*B*respectively.Power set of a set

*A*is the set whose members are all possible subsets of*A*(proper subsets and the set itself. v. supra). This will also include {}.

Ok, so there is the set theory, and we remember everything that we need to about membership and the relationship between the two sets. SQL is an expressive language for relational algebra, but can be coupled with set-theoretic models to aid conception of the operations being conducted. However, there are important caveats. One is that this set-theoretic model only works if there is a 1-1 correspondence between the sets *A* and *B*. For our purposes, think of *A* and *B* as two tables. If there are duplicate rows in either table, then this use of set-theory will break down horribly, because the cardinality of the sets changes.

# SQL operations

Let us now just think of the sets above as tables in a schema, in which there are records. We can think of the operations that we perform as returning a set of new records, and these will correspond to the same results as some of our binary set-theoretic operations. But to aid this understanding, I will now define two new tables (Sets), *A* and *B*.

## A

A.id A.name

-- ----

1 Tom

2 Sarah

3 Jamie

## B

B.id B.job

-- ----

2 Analyst

3 Developer

4 Manager

The goal here is to join the records in the two tables in meaningful ways.

- Full Outer Join

`SELECT * FROM A FULL OUTER JOIN B ON A.id = B.id`

id name id job

-- ---- -- ----

1 Tom *null* *null*

2 Sarah 2 Analyst

3 Jamie 3 Developer

*null* *null* 4 Manager

This is like the union. Where there is no match, the specification defines that the missing side should contain a *null*.

- Inner Join

`SELECT * FROM A INNER JOIN B ON A.id = B.id`

id name id job

-- ---- -- ----

2 Sarah 2 Analyst

3 Jamie 3 Developer

This is like the intersection.

- Records unique to both
*A*and*B*

`SELECT * FROM A FULL OUTER JOIN B ON A.id = B.id WHERE A.id IS null OR B.id IS null`

id name id job

-- ---- -- ----

1 Tom *null* *null*

*null* *null* 4 Manager

This is the symmetric difference.

- Records unique to
*A*

`SELECT * FROM A LEFT OUTER JOIN B ON A.id = B.id WHERE B.id IS null`

id name id job

-- ---- -- ----

1 Tom *null* *null*

This is the relative complement, but the opposite of what we defined above, in this case, *B ^{C} ∩ A*.

- Records unique to
*A*

`SELECT * FROM A LEFT OUTER JOIN B ON A.id = B.id`

id name id job

-- ---- -- ----

1 Tom *null* *null*

2 Sarah 2 Analyst

3 Jamie 3 Developer

This is the set from *A*, but the specification will show null values for any non match from *B*.

- Cross join or Theta Join

`SELECT * FROM A CROSS JOIN B`

This will join everything, resulting in a *n X m* result set. This is the cartesian product.

Note that you don’t have to result to CROSS JOIN to get the same (hopefully, unintended) result set.

- Power sets

`SELECT * FROM A WHERE A.id = 1`

This will produce subsets of *A* with appropriate WHERE clauses.

Pingback: Self-Taught Developers: Are You Missing Your Foundation? | Ameya Karve's Weblog