3

I am very new to using Google cloud and cloud servers, and I am stuck on a very basic question.

I would like to bulk download some ~60,000 csv.gz files from an internet server (with permission). I compiled a bunch of curl scripts that pipe into a gsutil that uploads to my bucket into an .sh file that looks like the following.

curl http://internet.address/csvs/file1.csv.gz | gsutil cp - gs://my_bucket/file1.csv.gz
curl http://internet.address/csvs/file2.csv.gz | gsutil cp - gs://my_bucket/file2.csv.gz
...
curl http://internet.address/csvs/file60000.csv.gz | gsutil cp - gs://my_bucket/file60000.csv.gz

However this will take ~10 days if I run from my machine, so I'd like to run it from the cloud directly. I do not know the best way to do this. This is too long of a process to use the Cloud Shell directly, and I'm not sure what other app on the Cloud is the best way to run an .sh script that downloads to a Cloud Bucket, or if this type of .sh script is the most efficient method to go about bulk downloading files from the internet using the apps on Google Cloud.

I've seen some advice to use SDK, which I've installed on my local machine, but I don't even know where to start with that.

Any help with this is greatly appreciated!

morepenguins
  • 823
  • 7
  • 16

2 Answers2

2

Gcloud and Cloud Storage doesn't offer the possibility to grab objects from internet and copy these directly on a bucket without intermediary (computer,server or cloud application).

Regarding which Cloud service can help you for run a bash script, you can use a GCE always free F1-micro instance VM (1 instance free per billing account)

To improve the upload files to a bucket, you can use GNU parrallel to run multiple Curl Commands at the same time and improve the time to complete this task.

To install parallel on ubuntu/debian run this command:

sudo apt-get install parallel

For example you can create a file called downloads with the commands that you want to parallelize (you must write all curl commands in the file)

downloads file

curl http://internet.address/csvs/file1.csv.gz | gsutil cp - gs://my_bucket/file1.csv.gz
curl http://internet.address/csvs/file2.csv.gz | gsutil cp - gs://my_bucket/file2.csv.gz
curl http://internet.address/csvs/file3.csv.gz | gsutil cp - gs://my_bucket/file3.csv.gz
curl http://internet.address/csvs/file4.csv.gz | gsutil cp - gs://my_bucket/file4.csv.gz
curl http://internet.address/csvs/file5.csv.gz | gsutil cp - gs://my_bucket/file5.csv.gz
curl http://internet.address/csvs/file6.csv.gz | gsutil cp - gs://my_bucket/file6.csv.gz

After that, you simply need to run the following command

parallel --job 2 < downloads

This command will run up to 2 parallel curl commands until all the commands in the file have been executed.

Another improvement you can apply to your routine is to use gsutil mv instead gsutil cp, mv command will delete the file after success upload, this can help you to save space on your hard drive.

Dondi
  • 2,288
  • 1
  • 8
  • 13
Jan Hernandez
  • 3,294
  • 2
  • 8
  • 15
  • 1
    Thank you @JAHernandez! This answer is super helpful, and I really appreciate your tip about the parallelization! – morepenguins Sep 30 '20 at 01:25
  • 1
    Or simply make GNU Parallel build the command lines from a template: `seq 60000 | parallel 'curl http://internet.address/csvs/file{}.csv.gz | gsutil cp - gs://my_bucket/file{}.csv.gz'` – Ole Tange Oct 02 '20 at 10:15
0

If you have the MD5 hashes of each CSV file, you could use the Storage Transfer Service, which supports copying a list of files (that must be publicly accessible via HTTP[S] URLs) to your desired GCS bucket. See the Transfer Service docs on URL lists.

mhouglum
  • 2,017
  • 11
  • 20