Kedro context and catalog missing from Jupyter Notebook

Question

I am able to run my pipelines using the kedro run command without issue. For some reason though I can't access my context and catalog from Jupyter Notebook anymore. When I run kedro jupyter notebook and start a new (or existing) notebook using my project name when selecting "New", I get the errors following errors:

context

NameError: name 'context' is not defined

catalog.list()

NameError: name 'catalog' is not defined

EDIT:

After running the magic command %kedro_reload I can see that my ProjectContext init_spark_session is looking for files in project_name/notebooks instead of project_name/src. I tried changing the working directory in my Jupyter Notebook session with %cd ../src and os.ch_dir('../src') but kedro still looks in the notebooks folder:

%kedro_reload

java.io.FileNotFoundException: File file:/Users/user_name/Documents/app_name/kedro/notebooks/dist/project_name-0.1-py3.8.egg does not exist

_spark_session.sparkContext.addPyFile() is looking in the wrong place. When I comment out this line from my ProjectContext this error goes away but I receive another one about not being able to find my Oracle driver when trying to load a dataset from the catalog:

df = catalog.load('dataset')

java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver

EDIT 2:

For reference:

kedro/src/project_name/context.py

    def init_spark_session(self) -> None:
        """Initialises a SparkSession using the config defined in project's conf folder."""

        # Load the spark configuration in spark.yaml using the config loader
        parameters = self.config_loader.get("spark*", "spark*/**")
        spark_conf = SparkConf().setAll(parameters.items())

        # Initialise the spark session
        spark_session_conf = (
            SparkSession.builder.appName(self.package_name)
            .enableHiveSupport()
            .config(conf=spark_conf)
        )
        _spark_session = spark_session_conf.getOrCreate()
        _spark_session.sparkContext.setLogLevel("WARN")
        _spark_session.sparkContext.addPyFile(f'src/dist/project_name-{__version__}-py3.8.egg')

kedro/conf/base/spark.yml:

# You can define spark specific configuration here.

spark.driver.maxResultSize: 8g
spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
spark.sql.execution.arrow.pyspark.enabled: true

# https://kedro.readthedocs.io/en/stable/11_tools_integration/01_pyspark.html#tips-for-maximising-concurrency-using-threadrunner
spark.scheduler.mode: FAIR

# JDBC driver
spark.jars: drivers/ojdbc8-21.1.0.0.jar

Can you check if you are able to run the `%reload_kedro` magic command in notebook? — Rahul Kumar, Feb 02 '21 at 17:19
@RahulKumar this led me to find the real problem, just not the solution yet...see my edits — Pierre Delecto, Feb 02 '21 at 18:18
You can run `os.chdir('../src')` if you want to set working directory to `src` — Rahul Kumar, Feb 02 '21 at 18:30
@RahulKumar unfortunately the project still looks in the notebooks folder after I os.chdir — Pierre Delecto, Feb 02 '21 at 18:50
That's weird. And regarding `ClassNotFoundException` add this config in the spark.yaml `spark.jars: /ojdbc8.jar` the ojdbc8.jar can be found in this zip file https://download.oracle.com/otn_software/linux/instantclient/211000/instantclient-basic-linux.x64-21.1.0.0.0.zip I don't think add pyfile will work for jars. — Rahul Kumar, Feb 02 '21 at 18:56
@RahulKumar I already had a jar in src/drivers and added spark.driver.extraClassPath: drivers/ojdbc8-21.1.0.0.jar to spark.yml. Is this wrong? — Pierre Delecto, Feb 02 '21 at 18:59
Ya even for me `spark.driver.extraClassPath` didn't work, don't know why but `spark.jars` worked perfectly. — Rahul Kumar, Feb 02 '21 at 19:00
@RahulKumar I changed spark.yml to have key spark.jars and the pipeline runs but the Jupyter Notebook still fails to find the driver. — Pierre Delecto, Feb 02 '21 at 19:04
@PierreDelecto can you paste the content of your `init_spark_session` here please? and spark.yaml as well. — Lim H., Feb 02 '21 at 22:56

Lim H. · Answer 1 · 2021-02-03T00:34:21.297

I think a combination of this might help you:

Generally, let's try to avoid manually interfering with the current working directory, so let's remove os.chdir in your notebook. Construct an absolute path where possible.
In your init_spark_session, when addPyFile, use absolute path instead. self.project_path points to the root directory of your Kedro project, so you can use it to construct the path to your PyFile accordingly, e.g. _spark_session.sparkContext.addPyFile(f'{self.project_path}/src/dist/project_name-{__version__}-py3.8.egg')

Not sure why you would need to add the PyFile though, but maybe you have a specific reason.

Kedro context and catalog missing from Jupyter Notebook

1 Answers1