What happens when you type a URL into your browser?
This is something a lot of us probably do. We go on our desktop or labtop computer, we type something into our browser and we expect to see a landing page that we’ve grown accustomed to seeing from Instagram or Google. It’s quite a complex process that makes this all happen and really incredible that we can do something as simple as that; type an English word into the browser and have the computer understand what we want and how to get us there. We’ll use the example of typing in https://www.holbertonschool.com as reference as we dive into what’s happening behind the scene.
Let’s split up the phrase we type into our browser. This phrase is known as the URL: Uniform Resource Locator. A URL is one type of Uniform Resource Identifier which is the name for everything that references objects on the World Wide Web. It starts with https. You may or may not type this part in right? You can just type ‘google.com’ and you’ll see once you get to the page, it auto-fills in the implicit information that came with that domain name. I’ll go into how that works later. But for now, whether you type it in or not, when you go to holbertonschool.com, as our example, it will auto-complete either https:// or http://.
HTTP
HTTP stands for HyperText Transfer Protocol. HTTP defines the protocol for how messages should be sent and received. It also gives Web servers and browsers a protocol for how to respond to commands. Think of HTTP as a set of rules the web server, browser and the World Wide Web live by to communicate with each other. This first part of the URL is called the Protocol identifier. In our case, we’re actually using https, but we’ll talk about the s later.
://
The colon slash slash is a symbol signifying the beginning of the protocol being used to access a resource. It’s mostly just a syntactical thing of how the initial designers decided to build it. The way web pages are configured is that they begin with a protocol identifier, i.e. HTTP, HTTPS, and resource name separated by a colon and two forward slashes. This is the formatting they were programmed for.
www
This part of the URL is known as a subdomain. You might’ve at some point noticed strange behavior with subdomains. Some websites you can type in example.com and it works fine. Other times, you may need to specify www.example.com, or info.example.com. Subdomains can change the behavior of the site you’ll end up on depending on how the web page is configured. For example, in west.example.com, or hello.example.com would be subdomains of the example.com domain. Depending on how the website is configured, there can be lots of subdomains that all redirect to the main page. www is the most common subdomain used.
holbertonschool
This is the domain name. This is the part of the URL that gets converted to an IP Address.
.com
Com is what is known as a TLD — top-level domain. .com is the largest TLD, but there’s also .org, .edu , .net, etc. .com lets the Domain Name Service know where to look for the IP if it can’t it. One of the places the DNS will look is the TLD Servers until it finds the IP.
Now that we know what each individual piece is, let’s dive into a little more detail about how this actually works.
When you type https://www.holbertonschool.com into your browser, and hit Enter the first thing that happens is the browser and the Operating System (OS) check if they know the IP. The IP is a string of digits representing a unique Id for every machine in the network. If the browser doesn’t have the IP in it’s cache, they check the Domain Name Service. The Domain Name Service, or DNS, translates the names (holbertonschool.com) into numbers.
Resolving Name Server
The DNS utilizes a tool known as the Resolving Name Server to check the IP of holbertonschool.com. If the RNS doesn’t know, it has a series of places it will check until it finds it. First it goes to the root servers. There’s an invisible dot at the end of every URL. Really, the URL looks like this: https://www.holbertonschool.com. . Did you catch that dot at the end? That’s the Root. The DNS will check the root servers for the IP next. If the root servers don’t know the IP, they’ll check with the Top Level Domain Server. This is the .com we were talking about. The TLD servers hold the IPS for .com, .net, etc. If the TLD servers don’t know, they’ll check the Authoritative Name Servers (ANS). This is the furthest down the road you can get to find the IP address, so at this point they’ll (hopefully) have found the IP in the Domain Name Registrar and return the IP back down the line, passed to the RNS which returns the IP to the browser. The browser will cache this IP so that next time it has it immediately on hand.
HTTP Request
When the IP is passed back to the browser, it will build a HyperText Transfer Protocol. This is when I should probably mention that little ‘s’ at the end of http in our example of holbertonschool.com.
What is HTTPS?
That little s at the end of HTTP is pretty important. It signifies that it’s a secure version of HTTP, ‘s’ standing for Secure. HTTPS means that all communications between your browser and the websites are encrypted. It’s often used to protect highly confidential online transactions like online banking and online shopping order forms.
SSL
HTTPS typically uses SSL or TLS which are types of certificates. The website will send its SSL certificate to your browser. The certificate contains the public key needed to begin the secure session in which a ‘handshake’ occurs between the browser and website. When a trusted SSL digital certificate is used during HTTPS, you’ll see a green padlock icon in the corner of the URL bar.
Why HTTPS?
The reason the secure version of HTTP is so important is because HTTP communications are all sent in plain text. This means if you enter your credit card information, for example, into an unsecured website, your credit card number, expiration date, name and security code are all readable by any hacker that breaks into the connection between the browser and the website. If the website uses an SSL certificate, even if the hacker broke into the connection, all of your information is encrypted, so they wouldn’t be able to decrypt any of the data.
So we ended off that the IP was returned to the browser. Now that the browser has the IP, it begins the Transmission Control Protocol (TCP)/Internet Protocol (IP).
TCP/IP
TCP/IP is the suite of communication protocols most networks (all computers on the Internet) use to communicate. It specifies how data is exchanged over the internet including how it should be broken into packets, addressed, transmitted, routed, and received at the destination. It’s designed to make networks reliable and it recovers automatically from failure of any device on the network. To break them up slightly. TCP defines how applications can create channels of communication across a network. IP defines how to address and route each packet to make sure it reaches the right destination.
Server
We’ve gotten this far without talking about servers. Nothing on the World Wide Web would work without servers. So what is a server? A server is a computer without a keyboard, mouse or screen that is accessible only by network. It can be physical or virtual, it’s configured and managed remotely and it runs an OS. A server is composed of multiple pieces including an Application server, a Web server, a database, HTML and Source Code. Let’s dive into what each of these does. The server the web server and source code to communicate with the computer of the user requesting the website.
Application Server
An application server’s job is to handle all application operations between the users and the backend. It serves dynamic content and generates an equivalent of a HTML file. It processes dynamic programming languages such as Ruby, Python, or PHP and generates HTML.
Web Server
A web server’s job is to serve web pages. It process dynamic and static web content and serves HTML code to browser. Web Servers can’t process languages such as Ruby, Python or PHP like the application server can. Web servers can process static content like images, HTML and CSS.
Database
The database is a structured set of data held in a computer. It’s organized to be easily accessed, managed and updated. In the case of holbertonschool.com, a database might store the images of mentors, instructors and school.
All servers are configured differently, but there’s some problems with having just one server. What if something happens to your database? What would happen if Facebook had just one server? For one, they would very quickly run out of storage space. Secondly, what if that server went down? Without a backup, this would affect millions of people. For these reasons, and many more, you want to have multiple servers. In the case of bigger companies, hundreds and hundreds of servers. One common way of managing multiple servers is using what is called a load balancer.
Load Balancer
A load balancer distributes the work-load to multiple individual systems to reduce the amount of stress on any individual system. This increases reliability, efficiency, and availability of your application. A load balancer can split traffic between servers and you can even have multiple load balancers all balancing multiple servers so no single node gets overloaded.
There’s multiple distribution algorithms to manage distribution of traffic between servers. One of them is Round Robin, this passes each new connection to the next consecutive server in line, eventually distributing connections evenly across the machines being load balanced. Another algorithm is called Weighted Round Robin which proportions the ratio weight of each machine according to the connections it receives over time.
Load Balancer Configuration
One way to set up a load balancer is Active-Active, another is Active-Passive. In Active-Active, both can service an application at any given time. In Active-Passive, one node is on standby, ready to serve as backup and take over if the primary server is unable to serve. No matter which way the load balancer is configured, there should be two nodes that are virtually identical. If you’re wondering why they should be identical, it’s because if one server goes down and it’s not the same as the other server, this would cause major problems in regards to losing data and losing the original content and configuration that was otherwise present.
Database Configuration
An option for database configuration is Primary-Replica (Master-Slave). The way this works is one device has unidirectional control over one or more other devices. The ‘master’ logs the updates, which then ripple through to the slaves. The slave will then output the message saying it received the update successfully, allowing each subsequent server to send the update to the next until it gets through all the ‘slave’ databases.
Have you heard of a firewall? This is an additional element you can add to your server configuration for an extra layer of security.
Firewalls
The reason to add a firewall is to prevent unauthorized users from accessing private networks connected to the Internet. All messages that pass through the firewall are examined and it blocks any message that don’t meet specified security criteria. Firewalls can be software or hardware but ideally it’s a combination of both.
Along with firewalls, something called monitoring tools can be added to your server configuration in order to monitor the data being collected and passed through your server. It can be customized to give you alerts or display information specific to your needs.
After the IP is sent back….
So back to our original question, what happens after the DNS finds the IP of holbertonschool.com and it’s sent back to the browser. From here, the OS passes the IP through TCP/IP, validates the SSL Certificate, makes the handshake between the browser and the website, verifies it’s secure and encrypted, and passes it to the load balancer (assuming there is one). The load balancer will then decide which server to send the request to depending on the algorithm it’s configured with. It will get passed down to a server and the server will utilize the application server (depending on whether it’s serving static or dynamic content, which in our case is dynamic) to organize the source code (Python, PHP, etc.) which grabs information from the database and generates HTML. The application server generates this HTML page from everything it gathered, including static and dynamic content and passes it to the web server which serves the page to your browser.
There’s a lot of steps and pieces involved to serve web page content. It’s pretty amazing all we have to do as the user is type in https://www.holbertonschool.com and press enter. I hope this was helpful in beginning to understand the complexity of the World Wide Web and how much we take URLs for granted.
See you next time! Happy coding!