Elixir PostgreSQL
Ecto Migrations: Simple to Complex
Over the lifetime of an application, the application's database undergoes many transformations. New tables are added, new relationships are formed, columns are added and subtracted, data is massaged, and so on.
We know this well. This constant change is what underlies our need for database migrations.
These migrations span a wide range. On the one end we have the textbook modifications to our schema. We can pull up the Getting Started guide for Ecto to see how we might add a table or column. On the other end are more sophisticated migrations. These may involve multiple steps or even migrating and massaging data.
In this post, we'll explore this range.
Adding A Table
Let's add a posts table for our blog app.
def change do
create table(:posts) do
add :title, :varchar
add :content, :text
timestamps()
end
end
This gives us a primary key column id by default. This is a unique
integer column starting at 1 and counting upwards using a sequence managed
by the database. Every post ought to have a title and some content, so we've
added columns for those. Ecto gives us the flexibility to use data types
native to our database, so we've done just that with some of PostgreSQL's
native data types (varchar and
text).
Lastly, the
timestamps()
function will produce inserted_at and updated_at timestamp columns.
Running mix ecto.migrate will add the posts table and posts_id_seq
sequence to our database schema.
Altering A Table
We said that each post ought to have a title and some content, but we
aren't actually enforcing that. Our database can enforce the presence of
those columns when inserting and updating records. We just have to tell it
to do so with the not null modifier.
Let's generate a new migration.
def up do
alter table(:posts) do
modify :title, :varchar, null: false
modify :content, :text, null: false
end
end
We use the alter function to target our posts table. The modify
function declares how the named column is to be modified. We have to specify
the data type of the column. If it is the same, nothing changes. This is,
however, an opportunity to change a column's type. For our purposes though,
we are just declaring that these columns cannot be null, hence the null:
false.
Every up migration requires a down migration. Because we are modifying
existing columns, the down migration cannot be inferred, so we have to
spell it out explicitly.
def down do
alter table(:posts) do
modify :title, :varchar, null: true
modify :content, :text, null: true
end
end
The down migration explicitly says that these columns can be null. This
takes us back to where we were if need to rollback.
Adding A Column With A Default
Posts take time to write and sometimes go through a number of iterations.
We'd like to know whether a particular post is in a draft or published
state. We can achieve this by adding a published flag in a new migration.
def change do
alter table(:posts) do
add :published, :boolean, null: false, default: false
end
end
A post is either published or it isn't, so a boolean published column will
do the trick. This is a binary status, so null doesn't mean anything to
us, so we make the column not nullable. Lastly, we want to be explicit about
publishing a post, so we default the published status of a post to
false.
Referencing Another Table
Our posts don't appear out of nowhere. People write them and those people
would probably like attribution. We should add a table of people and then
reference those people in the posts[1].
def change do
create table(:people) do
add :name, :varchar
end
alter table(:posts) do
add :people_id, references(:people), null: false
end
end
This first creates the people table, again with an implicit primary key
id column. The second portion of the migration alters the posts table to
add a people_id column.
The references function adds a foreign key constraint to ensure the
integrity of the relationship between people and their posts. In other
words, the database makes sure that post records never get orphaned. With a
foreign key constraint in place, we can be sure that a post pointing to a
person with an id of 6 will definitely be there. If we try to delete the
person with an id of 6, our database will stop us reminding us that we
need to deal with the posts that depend on person 6 first. Perhaps the
right thing to do is delete those posts as well.
It should be noted that by default the
references
function assumes the key it is referencing has a name of id. So, in our
case references(:people) means there will be a foreign key from
people_id of posts to id of people. If you want to reference a
differently-named column, you can include the :column option as part of
the second argument to references specifying the intended name of the
referenced column.
Transitioning A Column
Let's imagine some time has passed and our app, now in production, has some people who are creating and publishing posts.
We'd like to be able to display in the UI the time at which a post was
published. We consider utilizing the updated_at column when a post in the
published state, but a post can be updated after it is published, so this
is not an accurate indicator.
Our next thought is to add a published_at timestamp column. If the post is
transitioned to the published state, then we set published_at to the
current time and then display that time as long as the post remains in the
published state. If published is changed from true to false for a
post, then we can null out the published_at value. That'll work, but
we've run into a data design smell.
With this second approach, we've introduced a denormalization of our data.
The published and published_at fields have partially overlapping
concerns and it is possible for them to get out of sync. We can achieve the
same ends with a third, simplified option that only involves one column.
We can replace the published flag with the published_at timestamp. We
maintain the binary indicator because published_at being null tells us it
is in a draft state whereas published_at being some timestamp tells us
both that it is published and the time at which it was published.
Let's add this migration.
def change do
alter table(:posts) do
add :published_at, :timestamp
remove :published
end
end
That will do the trick. But hold up a second. If we run this migration against our production database, we will be doing something very sad. We'll be throwing away a bunch of data and messing up the publish status of all our existing posts.
We need to transition some data before we can remove the published column.
We need to set all existing unpublished posts to have null for
published_at. All the existing published posts, however, need to be
assigned a timestamp. We already know that our existing data model was not
sufficient for representing when a post was published. We are going to have
to do some approximation, and that's okay.
Let's say that, in terms of this data migration, setting the published_at
value for all published posts to the same time that they were all last
updated is sufficient. So, for each post with published set to true, we
need to duplicate that post's updated_at value into the newly-created
published_at column. That will require an update_all statement[2].
Let's modify our migration from above.
import Ecto.Query
def up do
alter table(:posts) do
add :published_at, :timestamp
end
from(p in "posts",
update: [set: [published_at: p.updated_at]],
where: p.published)
|> MyApp.Repo.update_all([])
alter table(:posts) do
remove :published
end
end
This migration looks like exactly what we want, but when we run mix
ecto.migrate, we are going to see an odd error.
** (Postgrex.Error) ERROR (undefined_column): column p0.published_at does not
exist
How can it not exist for our update_all function when we clearly added
that column just a few lines above?
The issue is that the Ecto migrator is queueing up everything in our
migration to be executed against the database. When it reaches the
update_all statement it tries to create a prepared statement that involves
a column that has not yet been created.
To ensure that the first portion of our migration is executed before the
update_all is reached, we can utilize the
flush() function.
Placing flush() after the first portion of the migration will force
everything queued up so far to be executed against the database. This clears
the way for the rest of our migration to work.
def up do
alter table(:posts) do
add :published_at, :timestamp
end
flush()
from(p in "posts",
update: [set: [published_at: p.updated_at]],
where: p.published)
|> MyApp.Repo.update_all([])
alter table(:posts) do
remove :published
end
end
And like before, with any up migration, we need a down.
def down do
alter table(:posts) do
add :published, :boolean, null: false, default: false
end
flush()
from(p in "posts",
update: [set: [published: true]],
where: not is_nil(p.published_at))
|> MyApp.Repo.update_all([])
alter table(:posts) do
remove :published_at
end
end
This requires some of the same tricks as the up migration because we want
to be able to put the data back in place as best as we can in the event of
a rollback.
Conclusion
In this post, we've explored a range of simple to complex Ecto migrations involving both schema changes and data transformation. We looked at how you can create new tables as well as alter existing ones. We even explored in detail how you would go about massaging and migrating data in the event that you need to replace one column with another. Ecto is an abstraction on top of SQL and we may need to drop down to raw SQL migrations in some cases, but from what we've seen Ecto can get us pretty far.
[1]: The following migration assumes an empty database, as if we are in the
midst of creating an initial MVP of our application. If the database
already contained posts, then adding a non-nullable column would cause an
error due to a violation of the not null constraint.
[2]: When using the
update_all
function, there are two ways of specifying the terms of the update. As we
did above, the :update keyword can be used within our Query struct to
specify how to update the relevant records. Because we went this route, we
leave the required updates argument to update_all as an empty list. If
we don't specify the update as part of the query, then we need to include
those details as part of the updates argument to update_all.
Photo credit for cover image: Paul Morris, unsplash.com