Questions tagged [kedro]

Kedro is an open source Python library that helps you build production-ready data and analytics pipelines

90 questions
0
votes
0 answers

Kedro-mlflow usage - when to use it from notebooks, and when from kedro pipeline?

I'm a bit confused - what is the common practice for kedro-mlflow usage? It's seems slightly uncomfortable to use it only from kedro pipelines, but kedro intention is fully reproducible research. At the same time rather rare tutorials on…
0
votes
1 answer

How to use Chunk Size for kedro.extras.datasets.pandas.SQLTableDataSet in the kedro pipeline?

I am using kedro.extras.datasets.pandas.SQLTableDataSet and would like to use the chunk_size argument from pandas. However, when running the pipeline, the table gets treated as a generator instead of a pd.dataframe(). How would you use the…
0
votes
1 answer

Failed while loading data from data set SQLQueryDataSet

I am receiving this error: DataSetError: Failed while loading data from data set SQLQueryDataSet(load_args={}, sql=select * from table) when I run (within kedro jupyter…
0
votes
1 answer

Adding stream_results=True (execution_options) to kedro.extras.datasets.pandas.SQLQueryDataSet

Is it possible to add execution_options to kedro.extras.datasets.pandas.SQLQueryDataSet? For example, I would like to add stream_results=True to the connection string. engine = create_engine( "postgresql://postgres:pass@localhost/example" ) conn =…
0
votes
0 answers

Kedro: Save logging messages by namespace in the pipeline

Intro I am working on a project where I have several different target variables and we utilize the same modeling framework in Kedro to peg a pipeline to each of the target variables. Each pipeline is defined with its own namespace. I have a…
tabris
  • 1
0
votes
2 answers

Parquet file larger than memory consumption of pandas DataFrame

I am storing two different pandas DataFrames as parquet files (through kedro). Both DataFrames have identical dimensions and dtypes (float32) before getting written to disk. Also, their memory consumption in RAM is…
Nils Blum-Oeste
  • 5,588
  • 4
  • 21
  • 25
0
votes
2 answers

Kedro Conditional Pipes (or alternatives)

I am currently examining different design pattern options for our pipelines. Kedro framework seems like a good option (allowing to modular design pattern, visualization methods, etc.). The pipelines should be created out of many modules that are…
Jumpman
  • 35
  • 5
0
votes
3 answers

What does this python function signature means in Kedro Tutorial?

I am looking at Kedro Library as my team are looking into using it for our data pipeline. While going to the offical tutorial - Spaceflight. I came across this function: def preprocess_companies(companies: pd.DataFrame) ->…
0
votes
1 answer

TemplatedConfigLoader in register_config_loader not replacing patterns in catalog.yml (kedro)

I am using kedro to manage some data, for which I have a number of dataset CSVs in the same location. As described here, I should be able to store the filepath to this location in a globals.yml file, and use the ${...} syntax in my catalog, but I…
0
votes
0 answers

Kedro 0.17 Override global.yml with extra params

Im currently not able to update the globals.yml file with extra params passed at run time as I previously did with Kedro 0.16.x. I run kedro through run.py. @hook_impl def register_config_loader(self, conf_paths: Iterable[str]) ->…
0
votes
1 answer

SQLAlchemy Oracle - InvalidRequestError: could not retrieve isolation level

I am having problems accessing tables in an Oracle database over a SQLAlchemy connection. Specifically, I am using Kedro catalog.load('table_name') and getting the error message Table table_name not found. So I decided to test my connection using…
Pierre Delecto
  • 342
  • 1
  • 3
  • 19
0
votes
1 answer

Parallelism for Entire Kedro Pipeline

I am working on a project where we are processing very large images. The pipeline has several nodes, where each produces output necessary for the next node to run. My understanding is that the ParallelRunner is running the nodes in parallel. It is…
0
votes
2 answers

Kedro install - Cannot uninstall `terminado`

When running kedro install I get the following error: Attempting uninstall: terminado Found existing installation: terminado 0.8.3 ERROR: Cannot uninstall 'terminado'. It is a distutils installed project and thus we cannot accurately determine…
zeh
  • 765
  • 1
  • 7
  • 24
0
votes
1 answer

Specify Kedro data version within DataCatalog?

Is it possible to define data version with Kedro type: pandas.CSVDataSet filepath: data/01_raw/company/cars.csv versioned: True load_version: $USER_DEFINED_VERSION # Wanted to do this Currently, Kedro supports using a CLI to specify load…
mediumnok
  • 101
  • 1
  • 6
0
votes
1 answer

How do I reproduce experiments or specify the nodes execution order in Kedro?

Since kedro determines the execution graph based on the nodes input/outputs, the order of executions is non-deterministic. It can vary between runs. Even when I set a seed I may sample different data in different runs. Let says I have 3 nodes that…
mediumnok
  • 101
  • 1
  • 6