Questions tagged [kedro]

Kedro is an open source Python library that helps you build production-ready data and analytics pipelines

90 questions
5
votes
1 answer

DataBricks + Kedro Vs GCP + Kubeflow Vs Server + Kedro + Airflow

We are deploying a data consortium between more than 10 companies. Wi will deploy several machine learning models (in general advanced analytics models) for all the companies and we will administrate all the models. We are looking for a solution…
5
votes
1 answer

How to process huge datasets in kedro

I have pretty big (~200Gb, ~20M lines) raw jsonl dataset. I need to extract important properties from there and store the intermediate dataset in csv for further conversion into something like HDF5, parquet, etc. Obviously, I can't use JSONDataSet…
eawer
  • 1,265
  • 2
  • 12
  • 20
4
votes
1 answer

Where to perform the saving of an nodeoutput in Kedro?

In Kedro, we can pipeline different nodes and partially run some nodes. When we are partially running some nodes, we need to save some inputs from the nodes somewhere so that when another node is run it can access the data that the previous node has…
Baenka
  • 141
  • 9
3
votes
2 answers

Override nested parameters using kedro run CLI command

I am using nested parameters in my parameters.yml and would like to override these using runtime parameters for the kedro run CLI command: train: batch_size: 32 train_ratio: 0.9 epochs: 5 The following doesn't seem to work: kedro run…
evolved
  • 1,071
  • 12
  • 28
3
votes
1 answer

How can I read/write data from/to network attached storage with kedro?

In the API docs about kedro.io and kedro.contrib.io I could not find info about how to read/write data from/to network attached storage such as e.g. FritzBox NAS.
thinwybk
  • 2,493
  • 16
  • 40
3
votes
1 answer

How to write a list of dataframes into multiple sheets of ExcelLocalDataSet?

The input is a list of dataframes. How can I save it into an ExcelLocalDataSet where each dataframe is a separate sheet?
James Wong
  • 35
  • 1
  • 6
3
votes
2 answers

Pipeline can't find nodes in kedro

I was following pipelines tutorial, create all needed files, started the kedro with kedro run --node=preprocessing_data but got stuck with such error message: ValueError: Pipeline does not contain nodes named ['preprocessing_data']. If I run kedro…
eawer
  • 1,265
  • 2
  • 12
  • 20
3
votes
1 answer

Kedro with MongoDB and other document databases?

What's the best practice for using kedro with MongoDB or other document databases? MongoDB, for example, doesn't have a query language analogous to SQL. Most Mongo "queries" in Python (using PyMongo) will look something like this: from pymongo…
3
votes
1 answer

How to convert Spark data frame to Pandas and back in Kedro?

I'm trying to understand what is the optimal way in Kedro to convert Spark dataframe coming out of one node into Pandas required as input for another node without creating a redundant conversion step.
Dmitry Deryabin
  • 1,320
  • 11
  • 23
3
votes
1 answer

How to change the process count of the ParallelRunner in Kedro?

My pipeline makes a lot of HTTP requests. It’s not a CPU-heavy operation, I’d like to spin more processes than the number of CPU cores. How can I change this?
921Kiyo
  • 512
  • 3
  • 9
3
votes
1 answer

How to run the nodes in sequence as declared in kedro pipeline?

In Kedro pipeline, nodes (something like python functions) are declared sequentially. In some cases, the input of one node is the output of the previous node. However, sometimes, when kedro run API is called in the commandline, the nodes are not run…
Baenka
  • 141
  • 9
2
votes
1 answer

Kedro context and catalog missing from Jupyter Notebook

I am able to run my pipelines using the kedro run command without issue. For some reason though I can't access my context and catalog from Jupyter Notebook anymore. When I run kedro jupyter notebook and start a new (or existing) notebook using my…
Pierre Delecto
  • 342
  • 1
  • 3
  • 19
2
votes
1 answer

PartitionedDataSet not found when Kedro pipeline is run in Docker

I have multiple text files in an S3 bucket which I read and process. So, I defined PartitionedDataSet in Kedro datacatalog which looks like this: raw_data: type: PartitionedDataSet path: s3://reads/raw dataset: pandas.CSVDataSet load_args: …
mendo
  • 61
  • 5
2
votes
1 answer

How to use tf.data.Dataset with kedro?

I am using tf.data.Dataset to prepare a streaming dataset which is used to train a tf.kears model. With kedro, is there a way to create a node and return the created tf.data.Dataset to use it in the next training node? The MemoryDataset will…
evolved
  • 1,071
  • 12
  • 28
2
votes
1 answer

How to catalog datasets & models by S3 URI, but keep a local copy?

I'm trying to figure out how to store intermediate Kedro pipeline objects both locally AND on S3. In particular, say I have a dataset on S3: my_big_dataset.hdf5: type: kedro.extras.datasets.pandas.HDFDataSet filepath:…
crypdick
  • 4,829
  • 3
  • 31
  • 50
1
2 3 4 5 6