Datasets documentation
Beam Datasets
Get started
Tutorials
OverviewLoad a dataset from the HubKnow your datasetPreprocessEvaluate predictionsShare a dataset to the Hub
How-to guides
Overview
General usage
LoadProcessStreamUse with TensorFlowUse with PyTorchCache managementCloud storageSearch indexMetricsBeam Datasets
Audio
Vision
Text
Dataset repository
Conceptual guides
Reference
You are viewing v2.6.1 version. A newer version v4.8.4 is available.
Beam Datasets
Some datasets are too large to be processed on a single machine. Instead, you can process them with Apache Beam, a library for parallel data processing. The processing pipeline is executed on a distributed processing backend such as Apache Flink, Apache Spark, or Google Cloud Dataflow.
We have already created Beam pipelines for some of the larger datasets like wikipedia, and wiki40b. You can load these normally with load_dataset(). But if you want to run your own Beam pipeline with Dataflow, here is how:
- Specify the dataset and configuration you want to process:
DATASET_NAME=your_dataset_name # ex: wikipedia
CONFIG_NAME=your_config_name # ex: 20220301.en- Input your Google Cloud Platform information:
PROJECT=your_project
BUCKET=your_bucket
REGION=your_region- Specify your Python requirements:
echo "datasets" > /tmp/beam_requirements.txt
echo "apache_beam" >> /tmp/beam_requirements.txt- Run the pipeline:
datasets-cli run_beam datasets/$DATASET_NAME \
--name $CONFIG_NAME \
--save_info \
--cache_dir gs://$BUCKET/cache/datasets \
--beam_pipeline_options=\
"runner=DataflowRunner,project=$PROJECT,job_name=$DATASET_NAME-gen,"\
"staging_location=gs://$BUCKET/binaries,temp_location=gs://$BUCKET/temp,"\
"region=$REGION,requirements_file=/tmp/beam_requirements.txt"When you run your pipeline, you can adjust the parameters to change the runner (Flink or Spark), output location (S3 bucket or HDFS), and the number of workers.