3

I would like to set some target paths as global variables in Luigi.

The reason is that the target paths I'm using are based on the last run of a given numerical weather prediction (NWP), and it takes some time to get the value. Once I have checked which is the last run I create a path in which I will put several target files (with the same parent folder).

I'm currently repeating a similar call to get the value of the parent path for several tasks, and it would be much efficient to set this path as a global variable. I have tried to define global variable from within one function (get_target_path) called by a luigi class, but it looks like the global variable doesn't persist when I get back to Luigi pipeline.

This is moreover how my code looks like:

class GetNWP(luigi.Task):
    """
    Download the NWP data.
    """
    product_id = luigi.Parameter()
    date = luigi.Parameter(default=datetime.today().strftime('%Y%m%d'))
    run_hr = luigi.Parameter(default='latest')

    def requires(self):
        return None
    def output(self):
        path = get_target_path(self.product_id, self.date, self.run_hr,
                               type='getNWP')
        return luigi.LocalTarget(path)
    def run(self):
        download_nwp_data(self.product_id, self.date, self.run_hr)


class GetNWP_GFS(luigi.Task):
    """
    GFS data.
    """
    product_id = luigi.Parameter()
    date = luigi.Parameter(default=datetime.today().strftime('%Y%m%d'))
    run_hr = luigi.Parameter(default='latest')

    def requires(self):
        return None
    def output(self):
        path = get_target_path(self.product_id_PV, self.date, self.run_hr,
                               type='getNWP_GFS')
        return luigi.LocalTarget(path)
    def run(self):
        download_nwp_data(self.product_id, self.date, self.run_hr,
                          type='getNWP_GFS')


class Predict(luigi.Task):
    """
    Create forecast.
    """
    product_id = luigi.Parameter(default=None)
    date = luigi.Parameter(default=datetime.today().strftime('%Y%m%d'))
    run_hr = luigi.Parameter(default='latest')
    horizon = luigi.Parameter(default='DA')

    def requires(self):
        return [
                GetNWP_GFS(self.product_id, self.date, self.run_hr),
                GetNWP(self.product_id, self.date, self.run_hr)
                ]
    def output(self):
        path = get_target_path(self.product_id, self.date, self.run_hr,
                               type='predict', horizon=self.horizon)
        return luigi.LocalTarget(path)
    def run(self):
        get_forecast(self.product_id, self.date, self.run_hr)

The function get_target_path defines a target path based on the input parameters. I would like this function to set global variables that would be accessible from Luigi. For example as follows (just the code for the getNWP task):

def get_target_path(product_id, date, run_hr, type=None, horizon='DA'):
        """
        Obtain target path.
        """
        if type == 'getNWP_GFS':
            if 'path_nwp_gfs' in globals():
                return path_nwp_gfs
            else:
                ...
        elif type == 'getNWP':
            if 'path_nwp_model' in globals():
                return path_nwp_model
            else:
                filename = f'{nwp_model}_{date}_{run_hr}_{horizon}.{ext}'
                path = Path(db_dflt['app_data']['nwp_folder'])
                create_directory(path)
                global path_nwp_model
                path_nwp_model = Path(path) / filename
        elif type == 'predict':
            if 'path_predict' in globals():
                return path_predict
            else:
                ...

The global variable defined in this function doesn't exist when I'm back to Luigi.

Any ideas on how to solve this problem will be appreciated!

2 Answers2

1

As it seems there is no built in method to store the paths of Luigi's targets I finally decided to create a class which holds all the information related to Luigi's targets/paths. This class is used within Luigi's Tasks when calling external functions which need to know which are the target paths.

This class is imported in the main luigy script, and instantiated before defining the Tasks:

from .utils import Targets
paths = Targets()

class GetNWP(luigi.Task):
    """Download NWP data required to prepare the prediction."""

    product_id = luigi.Parameter()
    date = luigi.Parameter(default=datetime.today().strftime('%Y%m%d'))
    run_hr = luigi.Parameter(default='latest')

    def requires(self):
        return GetProductInfo(self.product_id)
    def output(self):
        path = paths.getpath_nwp(self.product_id, self.date, self.run_hr)
        path_gfs = paths.getpath_nwp_GFS(self.product_id, self.date, self.run_hr)
        return [luigi.LocalTarget(path),
                luigi.LocalTarget(path_gfs)]
    def run(self):
        download_nwp_data(self.product_id, date=self.date, run_hr=self.run_hr,
                          paths=paths, nwp_model=paths.nwp_model)
        download_nwp_data(self.product_id, date=self.date, run_hr=self.run_hr,
                          paths=paths, nwp_model=paths.gfs_model)   

class Predict(luigi.Task):
    """Create forecast based on the product information and NWP data."""

    product_id = luigi.Parameter()
    date = luigi.Parameter(default=datetime.today().strftime('%Y%m%d'))
    run_hr = luigi.Parameter(default='latest')

    def requires(self):
        return GetNWP(self.product_id, self.date, self.run_hr)
    def output(self):
        path = paths.getpath_predict(self.product_id, self.date, self.run_hr)
        path_gfs = paths.getpath_predict_GFS(self.product_id, self.date,
                                             self.run_hr)
        return [luigi.LocalTarget(path),
                luigi.LocalTarget(path_gfs)]
    def run(self):
        get_forecast(product_id=self.product_id, date=self.date,
                     run_hr=self.run_hr, paths=paths, nwp_model=paths.nwp_model)
        get_forecast(product_id=self.product_id, date=self.date,
                     run_hr=self.run_hr, paths=paths, nwp_model=paths.gfs_model)

where Targets class has the following structure:

class Targets:
    """Store Luigi's target paths."""

    def __init__(self):
        """Initialize paths and variables."""
        self.path1 = None
        self.path2 = None
        self.path3 = None

    def update_object(self, product_id, date=None, run_hr=None):
        """Update object based on inputs."""
        if self.prod_id is None:
            self.prod_id = product_id
        if self.path_1 is None:
            self.get_path_1(product_id)
        if self.path_2 is None:
            self.get_path_2(product_id)
        if self.path_3 is None:
            self.get_path_3(product_id)

    def get_path_1(self, product_id, ...)
        """Generate a path 1 for a luigi Task."""
        ... define self.path_1...

    def get_path_2(self, product_id, ...)
        """Generate a path 2 for a luigi Task."""
        ... define self.path_2...

    def get_path_3(self, product_id, ...)
        """Generate a path 3 for a luigi Task."""
        ... define self.path_3...

The main idea is to set the target paths only one time and use them from within each Luigi task as input parameters. This allows to:

  • Perform task more rapidly, and
  • Avoid errors if a target path changes due to new NWP oavailable.
0

You can use mixins if you'd like, but keep in mind that luigi tasks can inherit instance methods and parameters.

import os
import luigi
LUIGI_BASE_PATH='/path/to/luigi/dir'

class BaseTask(luigi.Task)
    product_id = luigi.Parameter()
    date = luigi.Parameter(default=datetime.today().strftime('%Y%m%d'))
    run_hr = luigi.Parameter(default='latest')

    def get_path_dynamic(self):
        return os.path.join(LUIGI_BASE_PATH, 
                            self.__class__.__name__, 
                            self.product_id,
                            ...)

    def output(self):
        return luigi.LocalTarget(self.get_path_dynamic())


class Predict(BaseTask):
    def run(self):
        ...

The added benefit is that you don't need to redefine the same parameters, and the the child task's name (Predict or GetNWP) will be inserted into the output path. I'm not sure how the path1, path2 etc attributes relate to the gpath_nwp() and similar functions since their definitions aren't included in the example, but you can mimic the same functionality using the @property decorator for defining getters and setters.

cangers
  • 330
  • 1
  • 9