I am able to run my pipelines using the kedro run
command without issue. For some reason though I can't access my context and catalog from Jupyter Notebook anymore. When I run kedro jupyter notebook
and start a new (or existing) notebook using my project name when selecting "New", I get the errors following errors:
context
NameError: name 'context' is not defined
catalog.list()
NameError: name 'catalog' is not defined
EDIT:
After running the magic command %kedro_reload
I can see that my ProjectContext init_spark_session is looking for files in project_name/notebooks instead of project_name/src. I tried changing the working directory in my Jupyter Notebook session with %cd ../src
and os.ch_dir('../src')
but kedro still looks in the notebooks folder:
%kedro_reload
java.io.FileNotFoundException: File file:/Users/user_name/Documents/app_name/kedro/notebooks/dist/project_name-0.1-py3.8.egg does not exist
_spark_session.sparkContext.addPyFile()
is looking in the wrong place. When I comment out this line from my ProjectContext this error goes away but I receive another one about not being able to find my Oracle driver when trying to load a dataset from the catalog:
df = catalog.load('dataset')
java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver
EDIT 2:
For reference:
kedro/src/project_name/context.py
def init_spark_session(self) -> None:
"""Initialises a SparkSession using the config defined in project's conf folder."""
# Load the spark configuration in spark.yaml using the config loader
parameters = self.config_loader.get("spark*", "spark*/**")
spark_conf = SparkConf().setAll(parameters.items())
# Initialise the spark session
spark_session_conf = (
SparkSession.builder.appName(self.package_name)
.enableHiveSupport()
.config(conf=spark_conf)
)
_spark_session = spark_session_conf.getOrCreate()
_spark_session.sparkContext.setLogLevel("WARN")
_spark_session.sparkContext.addPyFile(f'src/dist/project_name-{__version__}-py3.8.egg')
kedro/conf/base/spark.yml:
# You can define spark specific configuration here.
spark.driver.maxResultSize: 8g
spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
spark.sql.execution.arrow.pyspark.enabled: true
# https://kedro.readthedocs.io/en/stable/11_tools_integration/01_pyspark.html#tips-for-maximising-concurrency-using-threadrunner
spark.scheduler.mode: FAIR
# JDBC driver
spark.jars: drivers/ojdbc8-21.1.0.0.jar