Sharing data

There are several ways to share data between the pipeline and users:

Extract state vectors
Disk Access

State vector extraction

State vector extraction is the same if from a python script or from a python notebook. The information provided here is the same for either case despite it looking like a script but developed using a Jupyter Python Notebook.

Sensitive information

We first have to define our sensitive information that grants us access to our pipeline. The information (content) will change depending upon the database you are using (postgresql or shelve) and, of course, your choices for usernames, passwords, etc. The best way to achieve this is to:

$ source <repostiory root>/.docker/.env
$ source <environment profile>
$ source /proj/sdp/ops/db-read-access  # only if desired

These two scripts should set all of the variables that you need to make your current environment look like the inside of a docker contaier.

If you cannot set the environment before starting your Python environment, most true with Jupyter Notebooks, then you will need to load them into os.environ within Python. First, you will need to install python-dotenv to avoid rewriting that which exists. For the <environment profile> it have to be processed manually. Here is an example that should work:

import os
def load_environment_profile(filename:str):
    with open(filename, 'rt') as file:
        for line in file.readlines():
            key,value = line.replace ('export ','').replace ('\\\n','').strip().split('=')
            os.environ[key] = os.path.expandvars(os.path.expanduser(value))

The function will also work for <repository root>/.docker/.env but it will overwrite any variable that is already there making python-dotenv better. Since it works in either a script or notebook, we will cover using the python mechanism:

# define the repository root that will be used later
repository_root = '/home/niessner/Projects/Exoplanet/esp'

# load the base environment variables that docker compose would normally set
load_environment_profile(os.path.join(repository_root, '.docker/.env'))

# override the base enviroment variables that define personal choices
# in other words, change 'envs/alsMT' below to your <environment profile>
load_environment_profile(os.path.join(repository_root, 'esp/envs/alsMT'))

# load the environment variables that give read access to the ops pipeline
# it should be used if the desire is to communicate with a private pipeline
load_environment_profile('/proj/sdp/ops/db-read-access')

Connect to DB

We have prepared the enivornment in the previous with out senstive information to configure dawgie as it starts. There is are two esp level configurations that we must also satisfy; one of which requires repository_root from the previous section:

# LDTK necessary bit of pain
import os; os.environ["LDTK_ROOT"] = '/proj/sdp/data/ldtk'

# add ESP software to the python path
import sys; sys.path.append(repository_root)

Using our sensitive information, we connect to the database. The first step is to import all of the required DAWGIE elements. The second step is to connect the security system to the desired pipeline. The third and last step is to open a connection to the database. However, since we are not full pipeline, rather a script requesting DB access, we should only do a dawgie.db.reopen().

# initialize security and connect to the DB defined in through the environment"
import dawgie.db
import dawgie.context
import dawgie.security
dawgie.security.initialize(path=os.path.expandvars(
                    os.path.expanduser(dawgie.context.guest_public_keys)
                ),
                myname=dawgie.context.ssl_pem_myname,
                myself=os.path.expandvars(
                    os.path.expanduser(dawgie.context.ssl_pem_myself)
                ),
                system=dawgie.context.ssl_pem_file)
dawgie.db.reopen()

Ignore any warnings/errors about “No PGP keys found”.

With the script/notebook connected to a DB via a running pipeline, we can now extract data directly from it.

Extract state vectors

In order to extract data, we need more information: Which run id? Which target? Which task, algorithm, and state vector? Which value of the state vector?

We now need to be more pedantic. The term “state vector” is used intentionally nebulous here. Like the word “name” it can have many meanings all at once. Technically, the state vector is a segment of the full name for any value generated by a pipeline. What in the heck is that supposed to mean? All pipelines generate software elements dawgie.Value that are stored and tracked by their full name: run_id.target.task.algorith.state_vector.value. Just like when we ask what is your name, name means many things depending on context. Informal it is your given name. Formal it is your surname. When we say state vector, we usually mean its full name to include all of its values. As in “extract a state vector” means we would need to know its full name not just the name given to the software object dawgie.StateVector. Hence, the need for all this new information.

There are an infinite number of ways to obtain a state vector’s full name. There are, however, two simple methods for obtaining state vector data: One, manually retrieve full name by interactng from the running pipeline. Two, use an algorithm to load the data.

For both of these instances, we will want the latest or most recent data. To do this, we need a run id larger than any that exist. Therefore, we will use runid = 2**30. We will also need to specify a target, and, for documentation, we will use target = 'GJ 1214'. Lastly, we will declare the task.alg.sv name to be system.finalize.parameters.

by full name

The advantage to using the full name is that you can select just the state vectory or value of a state vector.

The disadvantage to using the full name is that you have to provide a thin dummy algorithm that depends upon the item of interest.

The job of the thin algorithm wrapper is to tie the name to parts of the software so that we can load pickled data into the appropriate objects. How the name we use as a state vector is tied is pretty straight forward only by choice of the lead developer. Lets break the name into its constituent parts:

run id is a monotonic value assigned by the pipeline that gives temporal order to the information. Since we want the latest, our value will be 2**30.
target is the name as it matches in the database. Our value will be GJ 1214.
task is the name of the python module where the code that generated the state vector resides. Our value will be system.
algorithm is the name of the process within task that generated the state vector. Our value will be finalize.
state vector is the named collection of values generated by the algorithm. Our value will be parameters.
value is the dawgie.Value named by the state vector. Our value will be needed.

Because we are looking for the latest, we would write it as: GJ 1214.system.finalize.parameters.needed.

We now need to set up a dummy algorithm and task to load the data. There are three ways to load the data depending on your needs.

One, load everything the algorithm generated. While this is simplest and equivalent to by algorithm, it is generally overkill and load times can be excessive if you just want a single value.

import excalibur.system
import excalibur.system.algorithms

class DummyAlg(dawgie.Algorithm):
    def __init__(self):
        self.system = excalibur.system.algorithms.Finalize()
    def name(self): return 'dummy'
    def previous(self): return [dawgie.ALG_REF(excalibur.system.task, self.system)]
    def run(self): pass
    def state_vectors(self): return []
    pass

class DummyTsk(dawgie.Task):
    def __init__(self):
        dawgie.Task.__init__(self, 'dummy', 0, 2**30, 'GJ 1214')
    def list(self): return [DummyAlg()]

There is quite a bit of code to digest there. There are only a couple of really interesting parts.

The most important part of all the code is the ALG_REF in previous(). The ALG_REF is the only part that will change as we refine our data collection. The first parameter is the start of resolving the code and is called the factory function. Simply look in the repository to see it. The other paraemter is an isntance of an object that could be built by the factory.

The second most important part of the code is self.system because it is the object were the loaded data will be stored.

The next to least important part of the code is how dawgie.Task is initialized with our constants.

Now, lets refine our load to more specific data and see what changes.

Two, load a complete state vector. Loading a single state vector is more selective and, therefore, more efficient.

import excalibur.system
import excalibur.system.algorithms

class DummyAlg(dawgie.Algorithm):
    def __init__(self):
        self.system = excalibur.system.algorithms.Finalize()
    def name(self): return 'dummy'
    def previous(self): return [dawgie.SV_REF(excalibur.system.task, self.system,
                                              self.system.sv_as_dict()['parameters'])]
    def run(self): pass
    def state_vectors(self): return []
    pass

class DummyTsk(dawgie.Task):
    def __init__(self):
        dawgie.Task.__init__(self, 'dummy', 0, 2**30, 'GJ 1214')
    def list(self): return [DummyAlg()]

Only ALG_REF changed. It became SV_REF and we had to add an instance of the state vector (SV) that we want to be loaded. In this example I used the state vector already contained in the algorithm. This is the correct way to manage it,

You can probably guess that not much will change for a single value as well.

Three, load just the value that you desire. The most efficient way to collect data.

import excalibur.system
import excalibur.system.algorithms

class DummyAlg(dawgie.Algorithm):
    def __init__(self):
        self.system = excalibur.system.algorithms.Finalize()
    def name(self): return 'dummy'
    def previous(self): return [dawgie.V_REF(excalibur.system.task, self.system,
                                             self.system.sv_as_dict()['parameters'], 'needed')]
    def run(self): pass
    def state_vectors(self): return []
    pass

class DummyTsk(dawgie.Task):
    def __init__(self):
        dawgie.Task.__init__(self, 'dummy', 0, 2**30, 'GJ 1214')
    def list(self): return [DummyAlg()]

We simply moved from SV_REF to V_REF and added the name of the value that we wanted.

Once the code is in place, you would load the data with:

alg = DummyAlg()
tsk = DummyTsk()
for p in alg.previous():
    dawgie.db.connect (alg, tsk, 'GJ 1214').load(p)

There should be several items that jump out at you. First, you can parameterize some this to make it more flexible to your needs (less editing for different loads). Second, can load more than item at a time because each reference from previous{} is loaded. It may not be obvious, but it should be, that you can load values from all over the esp dependency graph in one list of various and many references. Finally, you may want to look at the data you loaded. You do it with:

alg.system.state_vectors()

I highly recommend using this example as written with a private pipeline to better understand it. Once comfortable with it, add another elements, load it, and show that it worked.

by algorithm

Sonetimes you just want the data used by an algorithm for debugging purposes. Using the same by full name example it would like:

import excalibur.system
import excalibur.system.algorithms

alg = excalibur.system.algorithms.Finalize()
for s in alg.state_vectors():
    dawgie.db.connect (alg,
                       excalibur.system.task('system',0,2**30, 'GJ 1214'),
                       'GJ 1214').load(dawgie.SV_REF(excalibur.system.task, alg, s))

Once again, all of the data can be found with alg.state_vectors().

Summary

Which method you use for extracting data from the database in a notebook or via script really depends on your use case. The more refinement you need or want will push you toward by full name where as simpler debugging of an algorithm toward by algorithm. I suggest playing with both and those that are bored can parameterize by full name to take a string and then return the results.

Disk access

Mostly, the disk access imprints through the container wall to make it simpler to understand. This was not originally the case, but it is now.

Logically, there are three spaces: pipeline data, user data, ephemeral data.

One, the pipeline data is all of the data generated by the pipeline during operation. It can be data it stages on the way to its store, database, logs, etc. In the operational system this logical area is /proj/sdp/data. Mostly, it should be left to the pipeline to manage. It will automatically rotate logs, clean up, and backup information there.

Two, the user data is pipeline read-only information. It can be certificates, security tokens, and any other read-only user specific information that helps identify the user.

Three, the ephermal data is that which survives for only the life of the docker container. These are things like matplotlib working directory, tensor compilation working directory, etc. These tools want a workspace that is not shared by other users. Keeping it in the container is the simplest way to accomplish the single user need.

While it should never be necessary, there is sometimes the desire to save some information outside of the state vector concept during the pipeline execution. When necessary, dawgie.context.data_stg should be used as the temporary holding ground. Note that the pipeline uses this space as well, which means the user, you, are responsible for selecting unique names that the pipeline would not choose. For the operational pipeline, dawgie.context.data_stg maps to /proj/sdp/data/stg.