Kedro is an open source Python library that helps you build production-ready data and analytics pipelines
Questions tagged [kedro]
90 questions
5
votes
1 answer
DataBricks + Kedro Vs GCP + Kubeflow Vs Server + Kedro + Airflow
We are deploying a data consortium between more than 10 companies. Wi will deploy several machine learning models (in general advanced analytics models) for all the companies and we will administrate all the models. We are looking for a solution…
Erick Translateur
- 69
- 2
5
votes
1 answer
How to process huge datasets in kedro
I have pretty big (~200Gb, ~20M lines) raw jsonl dataset. I need to extract important properties from there and store the intermediate dataset in csv for further conversion into something like HDF5, parquet, etc. Obviously, I can't use JSONDataSet…
eawer
- 1,265
- 2
- 12
- 20
4
votes
1 answer
Where to perform the saving of an nodeoutput in Kedro?
In Kedro, we can pipeline different nodes and partially run some nodes. When we are partially running some nodes, we need to save some inputs from the nodes somewhere so that when another node is run it can access the data that the previous node has…
Baenka
- 141
- 9
3
votes
2 answers
Override nested parameters using kedro run CLI command
I am using nested parameters in my parameters.yml and would like to override these using runtime parameters for the kedro run CLI command:
train:
batch_size: 32
train_ratio: 0.9
epochs: 5
The following doesn't seem to work:
kedro run…
evolved
- 1,071
- 12
- 28
3
votes
1 answer
How can I read/write data from/to network attached storage with kedro?
In the API docs about kedro.io and kedro.contrib.io I could not find info about how to read/write data from/to network attached storage such as e.g. FritzBox NAS.
thinwybk
- 2,493
- 16
- 40
3
votes
1 answer
How to write a list of dataframes into multiple sheets of ExcelLocalDataSet?
The input is a list of dataframes. How can I save it into an ExcelLocalDataSet where each dataframe is a separate sheet?
James Wong
- 35
- 1
- 6
3
votes
2 answers
Pipeline can't find nodes in kedro
I was following pipelines tutorial, create all needed files, started the kedro with kedro run --node=preprocessing_data but got stuck with such error message:
ValueError: Pipeline does not contain nodes named ['preprocessing_data'].
If I run kedro…
eawer
- 1,265
- 2
- 12
- 20
3
votes
1 answer
Kedro with MongoDB and other document databases?
What's the best practice for using kedro with MongoDB or other document databases? MongoDB, for example, doesn't have a query language analogous to SQL. Most Mongo "queries" in Python (using PyMongo) will look something like this:
from pymongo…
Benjamin Jack
- 73
- 5
3
votes
1 answer
How to convert Spark data frame to Pandas and back in Kedro?
I'm trying to understand what is the optimal way in Kedro to convert Spark dataframe coming out of one node into Pandas required as input for another node without creating a redundant conversion step.
Dmitry Deryabin
- 1,320
- 11
- 23
3
votes
1 answer
How to change the process count of the ParallelRunner in Kedro?
My pipeline makes a lot of HTTP requests. It’s not a CPU-heavy operation, I’d like to spin more processes than the number of CPU cores. How can I change this?
921Kiyo
- 512
- 3
- 9
3
votes
1 answer
How to run the nodes in sequence as declared in kedro pipeline?
In Kedro pipeline, nodes (something like python functions) are declared sequentially. In some cases, the input of one node is the output of the previous node. However, sometimes, when kedro run API is called in the commandline, the nodes are not run…
Baenka
- 141
- 9
2
votes
1 answer
Kedro context and catalog missing from Jupyter Notebook
I am able to run my pipelines using the kedro run command without issue. For some reason though I can't access my context and catalog from Jupyter Notebook anymore. When I run kedro jupyter notebook and start a new (or existing) notebook using my…
Pierre Delecto
- 342
- 1
- 3
- 19
2
votes
1 answer
PartitionedDataSet not found when Kedro pipeline is run in Docker
I have multiple text files in an S3 bucket which I read and process. So, I defined PartitionedDataSet in Kedro datacatalog which looks like this:
raw_data:
type: PartitionedDataSet
path: s3://reads/raw
dataset: pandas.CSVDataSet
load_args:
…
mendo
- 61
- 5
2
votes
1 answer
How to use tf.data.Dataset with kedro?
I am using tf.data.Dataset to prepare a streaming dataset which is used to train a tf.kears model. With kedro, is there a way to create a node and return the created tf.data.Dataset to use it in the next training node?
The MemoryDataset will…
evolved
- 1,071
- 12
- 28
2
votes
1 answer
How to catalog datasets & models by S3 URI, but keep a local copy?
I'm trying to figure out how to store intermediate Kedro pipeline objects both locally AND on S3. In particular, say I have a dataset on S3:
my_big_dataset.hdf5:
type: kedro.extras.datasets.pandas.HDFDataSet
filepath:…
crypdick
- 4,829
- 3
- 31
- 50