Common Errors¶
In this section, we walk through the most common error messages that arise when working with DBgen and how to handle them.
Data in the database don't match your expectations¶
Only one row¶
Symptom¶
You see only one row where you expected to see many. The one value in the database is the last value that was processed. dbgen run status
shows that the ETLstep in question processed the expected number of inputs but only made one insertion.
Possible Cause¶
If a ETLstep is only inserting one row where you expected it to insert many, you may have forgotten to set the __identifying__
attribute on the table in question. An example that illustrates the problem is shown below.
from typing import List
from dbgen import Entity, ETLStep, Extract, Model
class Number(Entity, table=True):
# __identifying__ is not set! This is the problem.
i: int
class IntExtract(Extract):
n: int
outputs: List[str] = ["i"]
def extract(self):
yield from range(self.n)
def make_model() -> Model:
model = Model(name="only_one_row")
with model:
with ETLStep(name="ints"):
i = IntExtract(n=10).results()
Number.load(insert=True, i=i)
return model
In this example, dbgen run status
will show that only one row was inserted even though 10 inputs were processed. Connecting to the database and querying the Number
table will show only one row with the last number processed (9).
Connection Errors¶
When connecting to a new database, it is generally useful to run dbgen connect --test
first to determine whether or not dbgen is able to connect to the database. Below, we walk through the most common errors that occur when connecting to a new database and how to resolve them.
Role postgres does not exist¶
Symptom¶
If you see FATAL: role "postgres" does not exist
in your error message, this means that dbgen is trying to connect to the database using the username postgres
, which is not your postgres username.
(.venv) ➜ alice_bob_lab git:(documentation) ✗ dbgen connect --test
---------------------------------------------
Checking Main DB...
Trouble connecting to database at postgresql://postgres:******@localhost:5432/dbgen.
(psycopg2.OperationalError) connection to server at "localhost" (::1), port 5432 failed: FATAL: role "postgres" does not exist
(Background on this error at: https://sqlalche.me/e/14/e3q8)
Cannot connect to Main DB at postgresql://postgres:******@localhost:5432/dbgen?options=--search_path%3dpublic!
---------------------------------------------
Checking Meta DB...
Trouble connecting to database at postgresql://postgres:******@localhost:5432/dbgen.
(psycopg2.OperationalError) connection to server at "localhost" (::1), port 5432 failed: FATAL: role "postgres" does not exist
(Background on this error at: https://sqlalche.me/e/14/e3q8)
Cannot connect to Meta DB at postgresql://postgres:******@localhost:5432/dbgen?options=--search_path%3ddbgen_log!
Possible Causes¶
1) postgres
is the default username for some postgres installations but not all. Another common default username is the user's bash username. To see what your bash username is, enter the command whoami
into the command line. If this is the problem, updating your DBgen .env
file with that username
will solve it.
In .env
, make sure this line has the correct username:
dbgen_main_dsn = postgresql://[your_username]@[host]:[port]/[database_name]
2) Your .env
could not be found
If you have set the username to a value other than postgres
in a DBgen .env
file, and you are still getting this error message, then DBgen is not finding your .env file and is falling back on global defaults. DBgen looks for .env files in the following places (in order):
- The location passed after the
-c
flag - The current directory
- The location set in the environmental variable
$DBGEN_CONFIG
If DBgen was recently able to connect to the database but stopped being able to connect after you changed directories, it is likely that it had found a .env
file in the working directory before you changed directories but not after.
Errors when running models¶
Missing imports in the transform¶
Symptom¶
You are running a ETLstep that includes a transform that requires an import statement to run, and when you call dbgen run ...
, you see an error that includes something that looks like:
NameError: name 're' is not defined
Error when running ETLstep ints
Error encountered while applying function named 'parse_string'
Below is an example that produces this error:
import re
from typing import List, Tuple
from dbgen import Entity, Environment, ETLStep, Extract, Import, Model, transform
class StringAndInt(Entity, table=True):
__tablename__ = "number"
__identifying__ = {"i", "s"}
i: int
s: str
class StringExtract(Extract):
n: int
outputs: List[str] = ["a_string"]
def extract(self):
for i in range(self.n):
a_string = f"string:{i}"
yield {"a_string": a_string}
# Env does not include the "re" package! This is the source of the error!
env = Environment([Import("typing", "Tuple")])
# The correct Env line is shown below
# env = Env([Import("typing", "Tuple"), Import("re")])
@transform(outputs=["s", "i"], env=env)
def parse_string(inp: str) -> Tuple[str, int]:
regex = r"([a-z]+):(\d+)"
match = re.match(regex, inp)
s, i_str = match.groups()
return s, int(i_str)
def make_model() -> Model:
model = Model(name="only_one_row")
with model:
with ETLStep(name="ints"):
a_string = StringExtract(n=10).results()
s, i = parse_string(a_string).results()
StringAndInt.load(insert=True, i=i, s=s)
return model
Possible Cause¶
Although the package that your transform depends on may well already be imported at the top of the file where the transform is defined, you must also make sure that the environment passed to the decorator @transform(env=my_env)
includes the package or function that your transform depends on. To do that, add an Env
to the existing one and include the missing package.
new_env = my_env + Env([Import("new_package")])
@transform(env=new_env, outputs=...)
def some_transform(...):
...
Mismatch between the number of output names and the number of outputs returned by a transform¶
Symptom:¶
You are running a model and one of the ETLSteps produces an error that looks this:
TypeError: You are attempting to iterate/unpack the arg object Arg(c58d...,wrong). This can commonly occur when a Extract or Transform outputs a single output when you expected two. Did you remember to set the outputs on your Extract or Transform?
A model that produces this type of error is shown below.
from typing import List
from dbgen import Entity, ETLStep, Extract, Model, transform
class Number(Entity, table=True):
__identifying__ = {"integer"}
integer: int
bigger_integer: int
smaller_integer: int
class IntExtract(Extract):
n: int
outputs: List[str] = ["i"]
def extract(self):
yield from range(self.n)
# The "outputs" list has only one output name, but the function returns
# two values. This is the problem!
# The correct line would be: @transform(outputs=["bigger_integer", "smaller_integer"])
@transform(outputs=['wrong'])
def alter_ints(integer: int):
bigger_integer = integer + 1
smaller_integer = integer - 1
return bigger_integer, smaller_integer
def make_model() -> Model:
model = Model(name="transform_output_count_mismatch")
with model:
with ETLStep(name="ints"):
integer = IntExtract(n=10).results()
bigger_integer, smaller_integer = alter_ints(integer).results()
Number.load(
insert=True, integer=integer, bigger_integer=bigger_integer, smaller_integer=smaller_integer
)
return model
Possible Cause¶
You will get an error like this if the number of the number of variables returned by a transform is not the same as the length of the list of output names (defined in @transform(outputs=[...])
).
You will also see this error if you do not set a value for outputs
at all in your @transform
definition, and your function returns multiple outputs. If you do not specify any output names, then DBgen assumes that the transform has only one output and gives it the name "out"
.
In either case, the solution to this problem is to define a list of output names in @transform(outputs=[...])
that is the same length as the number of outputs that your function returns.
Missing Identifying Information¶
Symptom¶
You are running a model and one of the ETLSteps produces an error that looks this:
DBgenMissingInfo: Cannot refer to a row in number without a PK or essential data. Missing essential data: {'integer'}
Did you provide the primary_key to the wrong column name? the correct column name is the table name (i.e. number=...)
A model that produces this type of error is shown below.
from typing import List
from dbgen import Entity, ETLStep, Extract, Model
class Number(Entity, table=True):
__identifying__ = {"integer"}
integer: int
bigger_integer: int
class IntExtract(Extract):
n: int
outputs: List[str] = ["integer", "bigger_integer"]
def extract(self):
for i in range(self.n):
yield i, i + 1
def make_model() -> Model:
model = Model(name="missing_id_info")
with model:
with ETLStep(name="ints"):
integer, bigger_integer = IntExtract(n=10).results()
# The load statement does not supply the identifying information,
# which is set in "__identifying__={...}"
# A valid load statement would be:
# Number.load(insert=True, integer=integer, bigger_integer=bigger_integer)
Number.load(insert=True, bigger_integer=bigger_integer)
return model
Possible Cause¶
Whenever you see this error, it means that you are trying to insert new rows or update existing rows, but you have not supplied the identifying information for the row you are trying to update or insert. Go to the .load()
statement that is producing the error, and look at the Entity
(meaning the Entity referred to in EntityName.load(...)
). Then, go to the definition of that Entity
and look at the set of column names that appear in __identifying__
. This is the information that you need to supply in your load statement.
In the case of inserting new rows, in your .load(insert=True, ...)
statement, you must supply values for each of the pieces of identifying information as keyword arguments:
EntityName.load(
insert=True,
column_name_from_Entity_definition = variable_name_from_transform_or_extract,
...
)
In the case of updating existing rows, you have the option of either supplying the identifying information for the row (as shown above) or the ID string for the row, which will always come from a query (as shown below).
id_string, raw_value = Query(select(EntityName.id, EntityName.raw_value)).results()
refined_value = a_transform(raw_value).results()
EntityName.load(
id=id_string,
refined_value=refined_value
)
Installation Errors¶
dbgen executable can not be found¶
Symptom¶
$ dbgen version
command not found: dbgen
Possible Cause¶
This error occurs when the dbgen executable cannot be found in the $PATH variable of your shell. This is commonly caused by two main issues:
- You have not correctly sourced the virtual environment in which dbgen was installed.
Solution: activate the virtual environment
- DBgen has not been installed through pip or poetry
Solution: Install dbgen through the pip or poetry methods shown in the installation section
Missing module error¶
Symptom¶
$ dbgen version
ModuleNotFoundError: No module named 'rich'
Possible Cause¶
This error occurs when a dependency of dbgen has not been installed. This can occur if the executable was installed to an incorrect python environment. This occurs when the pip executable from one python installation is used to install dbgen. You can be sure this is the issue if the command python -m dbgen version
does not cause issues, as this forces the use of the specific python installation first in your $PATH.
Uninstalling dbgen with the command pip uninstall modelyst-dbgen
and then reinstalling it in the correct location with python -m pip install modelyst-dbgen
will usually solve this issue. Just make sure to activate the virtual environment prior to running the installation command.