Blog

Downloading multiple files in Node.js

0 claps earned

Recently I had to download 70+ files from the web URLs—found as raw text strings. There was no way that I’d copy-paste the URLs 70+ times to browser address bar. How do I automate this simple task?

Background

Working on an internal presentation, I found this deck about Atlassian Design team via Slideshare. Lucky me, I thought, that there was even a big Download Now button with a kind text that says download to read offline. Unfortunately, albeit predictably, it didn’t allow me to download unless I sign up for subscription.

While they did offer a 30-day free trial, I was required to provide credit card info up-front to activate a free trial. I know they’re running a business and need to pay their server bills and what-not, but I didn’t like the idea of giving away my credit card info (or Paypal payment) just to download a reference document—knowing that I’m probably never going to use on a regular basis.

Process

1. Look at the markup and determine the file source

Fine, I won’t be able to download the source PDF or PPT without paying. Then what about the in-browser preview images? Wouldn’t there be standard image files for the web?

Inspecting via dev tools, I found that the HTML markup for the slides were pretty straightforward.

html
1<!-- Removed irrelevant data attributes and other details for brevity -->
2
3<div id="slide-container">
4 <div class="slide" id="slide-0" data-index="0">...</div>
5 <div class="slide" id="slide-1" data-index="1">...</div>
6 <div class="slide" id="slide-2" data-index="2">...</div>
7 <div class="slide" id="slide-3" data-index="3">...</div>
8 <div class="slide current" id="slide-4" data-index="4">
9 <picture>
10 <source srcset="https://path.to/file-title-5-size.jpg 2048w" ... />
11 <img
12 class="slide-image"
13 src="https://path.to/file-title-5-size.jpg 2048w"
14 ...
15 />
16 </picture>
17 </div>
18 <div class="slide" id="slide-5" data-index="4">...</div>
19 <div class="slide" id="slide-6" data-index="5">...</div>
20 <!-- this goes on until `data-index="70" -->
21</div>

Inside slide-container are a series of slides with data-index indicating the slide number, each of which containing a picture element pointing to a corresponding CDN address. Image file naming convention looked also very clear with comma-separated slide-title-##-#### format, in which the last parts of the string indicated the slide number and image width.

To test, I manually copy-pasted a few image file URLs for different slides and confirmed that I can indeed access these individual slides as JPG assets.

2. Try downloading directly inside a browser

To automate around these assets, I needed the full list. Now that I know the naming convention, this part was fairly easy. I first started with a plain for loop in the browser console.

javascript
1for (let i = 1; i < 72; i++) {
2 console.log(`https://path.to/file-title-${i}-2048.jpg`);
3}
console
1https://path.to/file-title-1-2048.jpg
2https://path.to/file-title-2-2048.jpg
3https://path.to/file-title-3-2048.jpg
4https://path.to/file-title-4-2048.jpg
5...

The snippet above sort of works. The URLs logged in the console are directly clickable and would open up a new tab with the target image in it. With huge pain points though:

  1. I need to click the links 71 times to open each file.
  2. I need to hit CMD S 71 times to save them individually.
  3. Chrome browser automatically converts this as WEBP format, but I'd need JPG

Pathetically ignorant in Node environment, I’d rather stick to browser environment as much as possible but now it felt inevitable that I would have to do this away from the browser.

3. Use core Node.js modules

For a throw-away task like this, I didn't want any dependencies here, so the obvious first step was to create a JavaScript file.

bash
1touch download_slides.js

Of the several resources that Google pointed me to, this article and this blog piece were the most succinct and practical for my needs.

I started by loading the modules.

javascript
1const fs = require("fs");
2const https = require("https");

Then generated URL strings and stored them in a variable.

javascript
1const files = new Array(71)
2 .fill("")
3 .map((item, index) => `https://path.to/file-title-${index + 1}-2048.jpg`);

Created a function that returns a single HTTPS request as a Promise. Each request response will be saved locally via fs.createWriteStream() method.

javascript
1const download = (url, destPath) => {
2 return new Promise((resolve, reject) => {
3 https.get(url, (res) => {
4 const filePath = fs.createWriteStream(destPath);
5 res.pipe(filePath);
6 resolve(true);
7 });
8 });
9};

With those single Promises, created an array that contains all HTTPS requests for assets.

javascript
1const createDownloadRequests = (urls) => {
2 const requests = [];
3 for (const url of urls) {
4 let urlObj = new URL(url);
5 let parts = urlObj.pathname.split("/");
6 let filename = parts[parts.length - 1];
7 requests.push(download(url, `${filename}`));
8 }
9 return requests;
10};

As a final step, use Promise.all() to carry out all downloads.

javascript
1(async () => {
2 try {
3 const requests = createDownloadRequests(files);
4 await Promise.all(requests);
5 } catch (err) {
6 console.log(err);
7 }
8})();

The full code looks like below:

javascript
1const fs = require("fs");
2const https = require("https");
3const files = new Array(71)
4 .fill("")
5 .map((item, index) => `https://path.to/file-title-${index + 1}-2048.jpg`);
6const download = (url, destPath) => {
7 return new Promise((resolve, reject) => {
8 https.get(url, (res) => {
9 const filePath = fs.createWriteStream(destPath);
10 res.pipe(filePath);
11 resolve(true);
12 });
13 });
14};
15const createDownloadRequests = (urls) => {
16 const requests = [];
17 for (const url of urls) {
18 let urlObj = new URL(url);
19 let parts = urlObj.pathname.split("/");
20 let filename = parts[parts.length - 1];
21 requests.push(download(url, `${filename}`));
22 }
23 return requests;
24};
25(async () => {
26 try {
27 const requests = createDownloadRequests(files);
28 await Promise.all(requests);
29 } catch (err) {
30 console.log(err);
31 }
32})();

4. Run the script

bash
1node download_slides.js

This part was the most obvious part. Running the script brought down all 71 JPGs I wanted. I ended up referencing and using only two of those slides. You might rightfully say I could’ve simply downloaded the two particular files rather than going through this hassle. But, hey, I learned a little something about Node with this process. 🤷‍♂️

What I’d do differently

The way it’s coded, this script is fairly limited in functionality and overly specific to this one use case.

  • It does not provide any meaningful feedback. Like:
    • what the status of each file is
    • which asset is currently being downloaded
    • whether the whole download process is completed
    • whether there was any specific errors, etc.
  • The asset URL strings are mostly hard-coded and cannot be reused for other contexts. Also it requires that I make sense of the file naming convention before running this script.

If I were to face similar needs in the future, I might want to add explicit feedback UI and would try to automate harvesting the target file names. Oh, also in TypeScript. Perhaps in Deno.

In any case, the presentation went well and I can come back to this blog post when/if I need to.

+0
Clap a few times
if you liked it
© 2023 by Bumhan Yu