Xtreem Geek

Geeking around The Web

Full Reading : The Akamai Network

Posted by shiva on September 17, 2011

This is a summary of what I got out of this paper on the Akamai network.

Overview

  • Akamai pioneered the concept of CDN (content delivery network).
  • 61,000 servers in 1000 networks across 70 countries.
  • Akamai delivers around 15% to 20% of all web traffic worldwide.
  • Beyond content delivery, Akamai also provides web and IP acceleration, EdgeComputing, delivery of live/on-demand HD media, high availability storage, analytics and authoritative DNS services.

Internet Delivery Challenges

  • Peering points, where internet traffic exchange occurs, lack sophistication due to lack of investment.  This makes peering points as bottlenecks causing packet loss and latency.
  • Border Gateway Protocol (BGP) is inefficient as it mainly relies upon hop count to decide routes without considering topologies, latencies and congestion.  Also, BGP is slow to react to outages.
  • Networks are unreliable.  Cable cuts, DDoS attacks, misconfigured routers, power outages, natural disasters etc. contribute to this.
  • Inefficient communication protocols.  TCP, designed for reliability and congestion-avoidance,  carries significant overhead and performs suboptimally for routes with high latency or packet loss.
  • Scalability of internet applications requires installing enough server capacity to handle peak traffic which would sit underutilized for most of the time.  In addition to origin scalability, there should be adequate bandwidth available at all points between the origin and the user in order to achieve a good user experience.  This is a serious problem for internet video.
  • Application limitations and slow rate of change adoption also hinders introducing new techniques / protocols to overcome the performance challenges.

Anatomy of Delivery Network

  • DNS mapping system: converts domain names to a nearby edge server IP address.  This mapping relies upon a huge amount of historical and current data about global network conditions.
  • Edge server: part of edge server platform and is responsible for processing requests from nearby users.  The edge server platform comprises of a large deployment of edge servers distributed across thousands of sites around the world.
  • Transport system: used to connect to the origin server (if required) with high reliability and performance in order to serve user requests.
  • Communication and control system: used to disseminate status information, control messages and configuration updates in a fault tolerant and timely fashion.
  • Data collection and analysis system: collects and analyses data from various sources such as server logs, client logs, network and server information.  Collected data can be used for monitoring, alerting, analytics, reporting and billing.
  • Management portal: helps enterprise customers to have fine grained control of how their content is served to end users.  This portal also helps enterprise customers gain visibility on how end users are interacting with their application and content, including reports on user demographics and traffic metrics.

Design Principles

  • Design for reliability: multiple levels of fault tolerance, protocols such as PAXOS for decentralized leader election.
  • Design for scalability: More than 60,000 machines across the globe.  Need to handle more traffic, content and customers.  Analyzing increasingly large volumes of log data.  Making the communication, control and mapping systems to scale to the ever increasing number of distributed machines.
  • Limit human management: Design for automatic failure recovery.  Helps keep cost low.  Currently only 60 operations personnel to manage 60,000 machines across the globe.
  • Design for performance: having fewer machines handle more traffic would help save energy costs.

High Performance Streaming

  • The edge servers are placed not only in large Tier 1 and Tier 2 data centers, but also in large number of end user ISPs.
  • Bandwidth usage for Obama inauguration was around 2 Tbps.  In a few years, the peak bandwidth usage for such one-time events is expected to reach 100 Tbps.  A single well connected data center or even a bunch of such data centers cannot handle such huge traffic.  On the other than, thousands of edge servers can serve tens of Gbps and together can easily handle such traffic.
  • Stream delivery quality is measured by Akamai using the following metrics: stream availability, startup time, frequency and duration of interruptions, effective bandwidth.  Akamai has deployed monitoring/measurement ‘agents’ around the world to simulate users playing video streams and testing the quality.
  • Tiered distribution helps reduce the load on the origin server.
  • Tiered distribution combined with overlay networks (edge server reaching out to multiple parent servers on disjoint network routes) helps enhance live streaming performance with minimal packet loss.

High Performance Application Delivery

  • Path optimization (between the origin and edge servers) helps to reduce latency.  Performance data from Akamai’s mapping system is used to decide the path(s).
  • Packet loss reduction is also achieved by path optimization.
  • TCP is tailored for high performance data transfer between Akamai servers (persistent connections, optimal TCP window size, intelligent retransmission).
  • Application level optimization like prefetching of HTML resources while the main page is served.  Compression of resource content.
  • Application protocols supported include HTTP, SSH, FTP, RDP, SSL VPN etc.

Distributing applications to the Edge

  • Running the applications themselves in the edge servers would offer the ultimate boost in performance.
  • While highly transactional applications that need to chat with origin databses may not be the best choices for edge computing, applications that do content aggregation/transformation, that deal with relatively static databases, that do just data collection are ideal candidates.
  • Even applications that are quite complex but which can minimize the DB interactions by means of caching can be pushed to the edge.

Platform Components

  • Edge server platform is highly configurable via the metadata configuration mechanism.  Configurable capabilities include: origin server url, cache control parameters, cache indexing (case sensitive, query parameters etc.), authentication/authorization, origin server failure response, edge computing, performance optimization etc.
  • Mapping system uses real-time as well as historic data about the health of the internet to decide the edge clusters for end users.  Within an edge cluster, the end user is mapped to a specific edge server based on factors including the likelihood of cache hit for requested content in that machine.  Hardware and network faults in edge servers are monitored and the failing servers are suspended till the problem gets fixed.  Mapping system itself is fault tolerant distributed platform that can survive multiple data center failures.
  • Communication and Control systems take care of real time distribution of status and control information, RPC and web services, dynamic configuration updates, key management and software and machine configuration management.
  • Data collection and analysis system takes care of log collection, real time data collection, monitoring, analytics and reporting.
  • Akamai has a global deployment of highly available and fault tolerant authoritative DNS servers.  These servers are primarily used for mapping end user IPs to edge clusters that can best handle the requests.  Further, the DNSs servers can serve customer zone records also; this is achieved by fetching the customer DNS zone records in a secure fashion.
  • Monitoring agents are deployed globally to monitor network and website performance.  Monitoring tests are configured both by mapping system (for real time network analysis) and customers (to analyze site performance).
  • Global Traffic Manager (GTM) is a DNS mapping service provided to customers who have origin servers deployed at multiple geographies.  An agent runs at customer origin servers to analyze internet performance parameters and feed into GTM, based on which end users as well as Akamai edge servers are routed to the most suitable origin server.
  • A high availability storage system takes care of storing the many content types (static files, media etc.).

Tidbits

  • Content that is less than 4.2 KB in size doesn’t benefit from compression since it would be small enough to fit into 3 data packets, which is the default size of initial TCP congestion window.  This content can be sent without any TCP ACKs.

 

Posted in Uncategorized | Tagged: , | Leave a Comment »

Full Reading : Bigtable

Posted by shiva on July 11, 2011

This is a summary of what I grasped out of the paper on Google Bigtable.

Overview

  • Bigtable stores petabytes of data and spans across thousands of machines.
  • At the time of writing of the paper, Bigtable was being used in Google Analytics, Google Earth and sixty other products.
  • Bigtable resembles a database and shares many implementation strategies, but it does not support full relational model.
  • Bigtable provides clients with simple data model with dynamic control over layout and format.  It also allows indexing using row and column names which are allowed to be arbitrary strings.  Data are treated as un-interpreted strings.
  • Clients can also dynamically control whether Bigtable serves the data from memory or from disk.

Data Model

  • A table in Bigtable is a sparse, distributed, persistent multidimensional sorted map.
  • A data cell is identified by the (row, column, timestamp) tuple.
  • Rows are stored in lexicographic order.  Transactions are only allowed per row and not across rows.  A set of lexicographically nearby rows together form a tablet – which gets stored in physical proximity.
  • There is no limit on number of Columns.  However, columns are segregated into sets called column families and the number of column families supported is finite / limited.
  • Columns are named using the convention column_family_name:column_name.  Some examples of column family names while storing web documents are anchor, language etc.
  • Timestamp is used to store the version.  Timestamp can be incremental numbers or exact time of creation of the cell.  Bigtable also holds configuration data per column family to indicate when to expire old cells (based on timestamp).
  • As an example, in case of web documents, the row names are URL of the document, the column names will be like anchor:cnnsi.com, anchor:my.look.ca, the timestamp can be arbitrary numbers.  The cells corresponding to the anchor family of columns will contain the anchor text.

Building Blocks

  • Bigtable depends upon Google Cluster Management Service for scheduling jobs, resource management on sharded machines, machine status monitoring and dealing with machine failures.
  • Bigtable uses GFS to store data and log files.
  • Bigtable uses SSTable immutable-file format.  SSTable provides a persistent, ordered immutable map from keys to values, where keys and values can be arbitrary byte strings.  SSTable provides API to do key based lookup and also iterating over the ordered set of keys between a given key range.
  • Bigtable uses Chubby as its distributed lock service provider.  Chubby is used for variety of purposes like to store bootstrap location of Bigtable data, to discover tablet servers, to finalize tablet server deaths and to store Bigtable schemas.

Implementation

  • Bigtable has 3 major components namely the client library, single master server and many tablet servers.
  • Master is responsible for assigning tablets to tablet servers, detecting tablet server addition/expiration, balancing tablet-server load and garbage collecting files in GFS.
  • Tablet location metadata is stored as a 3 level index.  The root level is stored in Chubby itself.
  • Each tablet is assigned to at most one tablet server.  The master keep track of the set of live tablet servers and current assignment of tablets to tablet servers, including which tablets are unassigned.  Whenever a tablet is to be loaded, the master identifies a free tablet server and sends the load request to it.
  • Chubby service is used to track live tablet servers.  Each tablet server when it comes up creates and holds an exclusive lock to a unique file in the Chubby service.  The master server communicates with the tablet servers to ensure that they continue to hold on to their corresponding locks.  If the tablet server has lost its lock or the master is unable to reach the tablet server, then the master server tries to get exclusive lock on the corresponding file in chubby service and deletes the lock.
  • When a new master comes up, it first acquires the master lock in Chubby service.  Then it communicates with all live tablet servers to get to know the list of tablets that they are currently service.  Also, the master reads the METADATA (SSTable) to know all the tablets.
  • Tablets are maintained as a group of SSTables and one memtable.  The SSTables are immutable while the memtable is where the latest updates go.  The latest updates are also writing into commit logs.   While loading a tablet, the server loads all the SSTables and reconstructs the memtable from the commit logs.
  • From time to time, the memtable would be frozen into a SSTable and a new memtable would get created.  Further, at times multiple SSTables and the memtabe would get compacted into a single SSTable – this is called minor compaction.  Also, all SSTables can be combined into a single SSTable – this is called major compaction.  Deleted rows would get removed during compaction phase.

Refinements

  • Column families can be associated to specific locality group in order to optimize lookup.  For e.g., in case of web documents, the metadata like anchors, language etc. would be kept in a separate locality group compared to web document content data.
  • Also, schema hints can be provided to keep column family data (SSTable) in memory for faster lookup.
  • Compression can be enabled for specific locality groups.  Two pass compression scheme is used – first pass compresses long common strings across a large window and second pass looks for repetitions in a small 16KB window of data.  Compression is done at block level so that any block can be uncompressed independently as required.  Rows are ordered in a way where data that are likely to be similar come closer.  For e.g., in case of web document storage, all documents from a single host appear in consecutive host thereby compressing better due to repeated boiler plate.
  • Caching is done at two levels.  Scan Cache is used to cache key value data for which lookups were done.  Block Cache stores SSTable blocks that were read from GFS.
  • Bloom filters can be enabled for column families.  Bloom filters help answer the question as to whether a given key/value pair might be present in a SSTable or not, thereby help avoid loading or lookup in a SSTable.
  • Commit Log file is maintained per tablet server.  During recovery, commit log is sorted in a way that would keep commit entries corresponding to nearby rows to appear closer.  Also, two commit logs are maintained per tablet server so that if one commit file is encountering hiccups due to GFS issues, the tablet server would switch to the other commit log.
  • Immutability of SSTable structure helps avoid need for locks/concurrency-checks.  It also helps easily achieve tablet splitting.  Further, deletion of rows from SSTable is handled via a mark & sweep garbage collection as a background job.

Posted in Uncategorized | Tagged: | Leave a Comment »

Singular / Plural Conversion for English words in .Net

Posted by shiva on April 21, 2011

Singular / Plural conversion services for English words is now baked into the .Net framework itself (3.5 and higher).  We should refer to System.Data.Entity.Design assembly.  However, note that we cannot expect perfection from these APIs.  The results are at best an informed guess – should be acceptable for most scenarios though.  A code sample is provided below…

PluralizationService service = PluralizationService.CreateService(CultureInfo.GetCultureInfo(1033));
Console.WriteLine(service.Pluralize("Product"));
Console.WriteLine(service.Singularize("Products"));
Console.WriteLine(service.IsSingular("Products"));
Console.WriteLine(service.IsPlural("Products"));

Also the service can be extended to add custom mappings.  Will come handy in case you want to override the default conversion rules (which may possibly be erroneous in some cases).  Below is a sample to achieve this…

PluralizationService service = PluralizationService.CreateService(CultureInfo.GetCultureInfo(1033));
((ICustomPluralizationMapping)service).AddWord("Food", "Food");
Console.WriteLine(service.Pluralize("Food"));
Console.WriteLine(service.Singularize("Food"));

Posted in Uncategorized | Tagged: | Leave a Comment »

OData Addressing Scheme

Posted by shiva on April 15, 2011

OData Addressing Scheme

Full information can be found here.

Posted in Uncategorized | Tagged: | Leave a Comment »

The Workflow Way

Posted by shiva on April 14, 2011

This is a good introduction by David Chappell on Workflow Foundation available in .Net 3.5 and .Net 4.0.  The article talks about what difference workflows make compared to regular sequential programming style.  Also, it explains how certain classes of problems naturally fit into workflows.  The article also contains ample examples to drive home the point.  A good read.

Posted in Uncategorized | Tagged: | Leave a Comment »

 
Follow

Get every new post delivered to your Inbox.