This is a series of notes about choosing the right data type from a data quality perspective, following the previous notes titled "Numeric vs. INT" and "DATE vs. VARCHAR(8).
The intention of this series is to present you with the challenge that you must take off the lens of Oracle. In my last project as a data modeler, I designed an entity type with the following structrue:
There were many true or false attributes in one entity type. The data type for the true or false attribute was VARCHAR2(1) in the domain dictionary and the possible values for the attribute were 'Y', 'N', and NULL. Since the DBMS was Oracle, VARCHAR2(1) was the optimal data type for these true or false attributes.
Unlike Oracle, PostgreSQL has a fully-fledged BOOLEAN data type. If you are working with PostgreSQL, you should register the BOOLEAN type in a domain dictionary and use it. There are two reasons why you should use BOOLEAN for true/false attributes.
Firstly, VARCHAR(1) cannot restrict the strings inserted into the column, which can lead to data integrity issues. The database does not produce an error message when you insert 'A' or 'B' other than 'Y' or 'N'. Secondly, VARCHAR(1) takes up more space than BOOLEAN, which can result in slower queries and higher storage costs.
To demonstrate my point, here is a quick and dirty script.
DROP TABLE T1;
CREATE TABLE T1 (
INSERT INTO T1
SELECT i, 'y','y','y'
FROM generate_series(1,10000) a(i);I have created a table with four boolean data types and inserted 10,000 rows.
DROP TABLE T2;
CREATE TABLE T2 (
INSERT INTO T2
SELECT i, 'y','y','y'
FROM generate_series(1,10000) a(i);I have also created a table with four varchar(1) data types and inserted 10,000 rows.Then I have checked the size of these two tables.
SELECT PG_RELATION_SIZE('t1') as boolean_table, PG_RELATION_SIZE('t2') as varchar_table;
| boolean_table | varchar_table |
| 368640 | 450560 |Note that the table with the BOOLEAN data type is smaller. Also, it is worth noting that there is no issue when using 'Y', 'y', 'N', 'n', or NULL in the WHERE clause with BOOLEAN columns, as they accept all of these values.
select * from t1 where c2 = 'Y' and c3 = 'y' and c4 = true;If you try to insert the value 'A', PostgreSQL rejects it.
analdb=# INSERT INTO T1 (c1, c2) VALUES (3,'A')
invalid input syntax for type boolean: "A"
LINE 1: INSERT INTO T1 (c1, c2) VALUES (3,'A')
For true or false attributes use the Boolean data type, which can lead to a reduction in disk space usage. Reduced disk space usage can lead to lower memory consumption. Lower memory consumption can result in reduced buffer lock contention and then some.
If you want to change the data type of a column, you can use the following DDL:ALTER TABLE
C2 TYPE BOOLEAN
USING CASE C2 WHEN 'Y' THEN TRUE
WHEN 'N' THEN FALSE
Keep in mind that changing the type via ALTER TABLE will exclusively lock the whole table while it is rewritten.