Made yourself data masking

This article describes how to create a VIEW for viewing depersonalized data. The solution described here is based on a solution from this article (getting a random string out of the table).

The main purpose of data masking is to obfuscate the real data and make it unrecoverable. But it is not enough just to hide the real data. Very ofthen it is necessary to make it looks as realistic as possible.

Such requirements emerge because data masking is used mostly for application testing and the data should look like as realistic as it possible. And a good solution for this is to use the data from the actual table but take it from random rows.

Let’s begin.

I will use a table named “connections” for an example. This table includes “ID” and “client_port” columns which should be masked. And the “ID” column is the table’s primary key.

Since some rows could be deleted and ID contains not a strictly consistent value, let’s create a table with data linked with row number. Essentially, it is the quickest way for PostgreSQL to select data by the row number. If you’re using Oracle database, you can skip this step.

create table client_port_ids
	    (
	    rowid serial PRIMARY KEY,
	    id integer
	    );
	-- filling the table with existing id numbers. table should be filled before masking
	INSERT INTO client_port_ids (id ) SELECT id FROM connections ORDER BY id;

Since you would like the database to show the same values at the masked row at every SELECT query, it is necessary to create a table to store the link between the real data and substitute.

create table client_port_map
	    (
	    src integer PRIMARY KEY,
	     dst integer
	    );

Let’s create a masking function to test if the masked data haven’t fetched before. And if this data is absent, the function takes the data from a random row.

CREATE OR REPLACE FUNCTION public.hide_client_port(
	val integer)
	RETURNS integer AS
	$BODY$
	DECLARE
	res integer;
	sed float;
	row_count integer;
	rand_row integer;
	BEGIN
	--check existing mapping
	SELECT dst into res FROM client_port_map WHERE src = val;
	IF FOUND = FALSE THEN
	--search random string
	select MAX(rowid) into row_count from client_port_ids;
	LOOP
	SELECT floor(random()*row_count) into rand_row;
	select client_port into res from connections where id = (select id from client_port_ids where rowid = rand_row);
	EXIT WHEN FOUND = TRUE;
	END LOOP;
	--saving new value to mapping
	INSERT INTO client_port_map VALUES (val, res);
	END IF;
	return res ;
	END;
	$BODY$
	LANGUAGE plpgsql VOLATILE

Let’s see how the entries are shuffled.

Since in this example we use a small table, some replaced entries match the real entries. That’s because it’s impossible to cheat the probability theory.

How this could be used? Let’s create a new schema with a VIEW using a table with real data. And for the “connections” table we create the following VIEW:

CREATE OR REPLACE VIEW public.connection AS
	SELECT connections.partition_id,
	connections.id,
	connections.interface_id,
	connections.client_host,
	hide_client_port(connections.client_port) AS hide_client_port,
	connections.begin_time,
	connections.end_time,
	connections.client_host_name,
	connections.instance_id,
	connections.proxy_id,
	connections.sniffer_id
	FROM connections;

As you see, it’s pretty easy. Of course, this function can be improved. For example you could employ a mechanism to assign different values to rows that contain similar values. But it’s another story.