2

I'm using pytube library to download videos. Locally I was directing the output_path variable in the download function of pytube to a place where I desired the video to be downloaded on my local system. However, now since I'm using the AWS Free EC2 instance which comes with no storage, I am clueless about how to directly download the video to an s3 bucket. Providing the s3 uri in the output path is not working as it tries to search that location on the local system.

I tried mounting the s3 bucket and then setting the output_path variable to mounting location but still, it did not download the videos. the scripted executed w/o any errors and without downloading the videos.

Is there any solution that lets one download youtube videos using pytube directly to an s3 bucket? Your solution is much appreciated!

Past code: (save files locally)

# function safe file name
def safe_filename(s: str, max_length: int = 255) -> str:
    # Characters in range 0-31 (0x00-0x1F) are not allowed in ntfs filenames.
    ntfs_characters = [chr(i) for i in range(0, 31)]
    characters = [r'"',r"\#",r"\$",r"\%",r"'",r"\*",r"\,",r"\.",r"\/",r"\:",r'"',r"\;",r"\<",r"\>",r"\?",r"\\",r"\^",r"\|",r"\~",r"\\\\",r"(",r")"]
    pattern = "|".join(ntfs_characters + characters)
    regex = re.compile(pattern, re.UNICODE)
    filename = regex.sub("", s)
    return filename[:max_length].rsplit(" ", 0)[0]

# function simplify name
def simplify(text):
    try:
        text = unicode(text, 'utf-8')
    except NameError:
        pass
    text = unicodedata.normalize('NFD', text.decode('utf-8')).encode('ascii', 'ignore').decode("utf-8")
    return str(text)

def downloader(lnks, threadlabel):
    failed=[]
    #holds links so that we can retry to download those we missed
    for link in lnks:
        try:
            yt = YouTube(link)
            #print(yt.title)
            name= safe_filename(yt.title)
            name= name.strip()
            name= name.encode("utf-8")
            name= simplify(name)
            name= re.sub('[^A-Za-z0-9]+', '', name)
            name= name.replace('[^\w\s]', '')
            print(name)
            if name + '.mp4' not in os.listdir(r'D:\OneDrive - Indian School of Business\Projects\CEO_Mindset_Master\Data_Repository\YT_Interviews_Data\Videos'):
                print(yt.streams.order_by('resolution').desc())
                print('----------------------------------------------------------------------------------------')
                print(yt.streams.filter(progressive=True, file_extension='mp4', type="video").order_by('resolution').desc().first())
                yt.streams.filter(progressive=True, file_extension='mp4', type="video").order_by('resolution').desc().first().download(output_path=r'Videos', filename= name)
            else:
                print(name, ' done')
                pass
        except:
            print('----------------------', link, 'not downloaded','----------------------')
            failed.append(link)
            break
    if len(failed)!=0:
        print(failed)
        with open('failed_'+threadlabel+'.txt', 'w+') as f:
            f.write(failed)

downloader(['https://www.youtube.com/watch?v=Pk4tE6Jfakg'], 'test')

code: output filepath s3 URI:

yt.streams.filter(progressive=True, file_extension='mp4', type="video").order_by('resolution').desc().first().download(output_path=S3_URI, filename= name)

code: output filepath mounted s3 location

yt.streams.filter(progressive=True, file_extension='mp4', type="video").order_by('resolution').desc().first().download(output_path='mnt/bucket/folder', filename= name)
abalmumcu
  • 61
  • 1
  • 5
  • Why do you think an AWS Free Tier EC2 instance has no storage? – jarmod May 25 '21 at 17:32
  • Does it work if you download to the local disk on the instance (eg the ec2-user home directory, or even /tmp)? – John Rotenstein May 26 '21 at 00:07
  • @jarmod, it has 8GB storage, and i have 100 gb of data to download. My script runs and downloads multiple videos in pararallel and it needs a place to store. – user16028265 May 26 '21 at 05:04
  • @JohnRotenstein Yes, it works if i download it to home, tmp, or any other location on the istance – user16028265 May 26 '21 at 05:05
  • If the instance doesn’t have enough storage by default then you can add an EBS volume though it may exceed your Free Tier. Alternatively, investigate if your download library can stream the data to you in which case you can use boto3 to stream it into S3, or see if your download library can work with a PUT url as the target instead of local file system then you may be able to use a pre-signed PUT URL for the target S3 object. – jarmod May 26 '21 at 11:39

1 Answers1

0

I recommend that you:

  1. Use your existing (working) code to download the file to the local disk, then
  2. Upload the file to Amazon S3 using an API the boto3 library (which is the AWS SDK for Python)

Even if you mount S3 as a disk (which is not typically a good idea) or use a Python library that allows a file to be referenced on S3, it will still involve a download of the data to the computer and then an upload to S3. It is much safer and simpler to do it as two steps, rather than relying on complex utilities or libraries to mimic S3 as a local storage device.

John Rotenstein
  • 165,783
  • 13
  • 223
  • 298