Log Structured File System – II

 

PDF : 

13-lfs

Notes :

inode : In a Unix-style file system, the inode is a data structure used to represent a filesystem object, which can be one of various things including a file or a directory. Each inode stores the attributes and disk block location(s) of the filesystem object’s data.

Log FS

write sequentially : never overwrite data

originally designed for disks

inode map is on the disk as well in a fixed location to seek always at one place
4 bytes required for every i node

garbage collection : you get to know through inode . When a block is updated , the inode will point to the updated block and not the original block.
So the original block is now garbage. It can be cleaned up and used for new writes

But this makes disk to be fragmented. scattered. This is taken care too.

Log FS is great for Flash

Log Structured File System – I

The driving force behind the Ousterhout and Rosenblum’s Log-structured File system was (is) the mechanical limitations of the disk drive.  Unlike processor or memory, disk drives have mechnical moving parts and is governed by the laws of Newtonian physics.  To read or write to disk  the arm first has to move to the desired track, then there’s a rotational delay until the disk spins to the relevant sector.  This access time is in the milliseconds, which is an eternity compared to memory speed or processor cycles.  Access time overhead is exasperated when the workload is frequent, small reads and writes.  More (relative) time is spent moving the the disk head around than actual data transfer.

[Aside. Slow disk drives is one of the reasons I prefer to develop on desktops and not laptops.  You get a fancy new MacBook Pro with the latest processor and a shit load of RAM only to be bounded by I/O.  Money is better spent on the fastest disk drive you can buy.]

The situation for reads is “easily” solved with file cache.  More memory, bigger caches, better hit rates, less read requests will have to go to disk.  But more memory does not help as much with writes.  File systems can buffer more writes to memory before flushing to disk but the flushes still need to be frequent to avoid data lost; and the writes still involve accessing random parts of the disk.

To see this clearly, below is a diagram of a traditional Unix File System involving writing two single-block files in two directories.

Screen shot 2012-03-18 at 9.02.45 AM

 

Unix FS involves 8 random, non-sequential writes (numbered, but not in that order).  4 to the inodes and 4 to the data blocks (2 directories, 2 files).  Half of these are synchronous writes to avoid leaving the file system in an inconsistent state.  The other half can be done with an asynchronous delayed write-back.  Newer file systems have many optimization to help with performance, like keeping inodes and data blocks closer together, but the point remains that these types of file systems suffer from the limitation of disk access time.

Ousterhout and Rosenblum’s log-structured file system gets around this by avoiding random, non-sequential writes altogether.  Writes are done asynchronously in a large sequential transfer.  This minimizes the access time latency and allows the file system to operate closer to the disk’s maximum throughput rate.  As the diagram shows, the same information is written to disk: 4 inodes and 4 data blocks (2 directories, 2 files).  But it’s written sequentially by appending to the log.   Data (both metadata like inode and the actual file data) is never overwritten in-place, just appended to the log.

This is clever and all but how do we get the data back?!?  In the traditional Unix FS the inodes are at fixed location(s).  Given inode number 123 it’s easy to calculate its disk location with a little math, and once we have the inode location we can get the data blocks.  This doesn’t work with LSF since inodes are not fixed–they’re appended to the log just like the data blocks.  Easy enough, create an inode map that maps inodes to their locations.  Wait a second, how can we then find the location of the inode maps?  Finally, it’s time to write to a fixed location, the checkpoint region.

Screen shot 2012-03-17 at 11.44.59 PM

 

The checkpoint region knows the location of the active inode maps.  At startup we read in the checkpoint region, load the locations of the inode maps into memory, then load the inode maps into memory.  From then on, it’s all in-memory.  The checkpoint region is periodically written to disk (checked point).  Once we have the inode maps read requests behave much like the traditional Unix FS: lookup the inode, perform access control, get the data blocks.

In summary, read requests don’t change much and we can leverage file cache to improve performance.  Write requests, however, show dramatic improvements, especially for frequent, small write requests, since we always write sequentially in large chunks.

But the story doesn’t end quite yet.  If we always append, never overwrite in-place, we will eventually run out of space unless we can reclaim free space.  Reclaiming free space, that sounds like memory garbage collection in programming languages; and that’s exactly what the LSF does, garbage collect.

Screen shot 2012-03-18 at 9.02.13 AM

Imagine that segments 5 and 6 have both live and dead blocks (files that have been deleted).  The segment cleaner (garbage collector) can compact segments 5 and 6 then copy only the live blocks into an available free segment.  Each segment has a segment summary block (not shown) with information about itself to help in this process (which blocks are dead, etc.).  Then it’s just a matter of moving the links in the segment linked list to restore the order.  I’m of course hand waving here as things are more involved.  Like memory garbage collection it’s in the details and optimizations that will determine if the system is performant.  Issues like garbage collecting long-live objects (data), when to run the collector, etc. emerge.

 

Reference :

  1. http://work.tinou.com/2012/03/log-structured-file-system-for-dummies.html

 

What is File Level Storage vs. Block Level Storage?

he two most popular storage system technologies are file level storage and block level storage. File level storage is seen and deployed in Network Attached Storage (NAS) systems. Block level storage is seen and deployed in Storage Area Network (SAN) storage. In the article below, we will explain the major differences between file level storage vs. block level storage.

File Level Storage – This storage technology is most commonly used for storage systems, which is found in hard drives, NAS systems and so on. In this File Level storage, the storage disk is configured with a protocol such as NFS or SMB/CIFS and the files are stored and accessed from it in bulk.

  • The File level storage is simple to use and implement.
  • It stores files and folders and the visibility is the same to the clients accessing and to the system which stores it.
  • This level storage is inexpensive to be maintained, when it is compared to its counterpart i.e. block level storage.
  • Network attached storage systems usually depend on this file level storage.
  • File level storage can handle access control, integrate integration with corporate directories; and so on.
  • “Scale Out NAS” is a type of File level storage that incorporates a distributed file system that can scale a single volume with a single namespace across many nodes. Scale Out NAS File level storage solutions can scale up to several petabytes all while handling thousands of clients. As capacity is scaled out, performance is scaled up.

Block Level Storage – In this block level storage, raw volumes of storage are created and each block can be controlled as an individual hard drive. These Blocks are controlled by server based operating systems and each block can be individually formatted with the required file system.

  • Block level storage is usually deployed in SAN or storage area network environment.
  • This level of storage offers boot-up of systems which are connected to them.
  • Block level storage can be used to store files and can work as storage for special applications like databases, Virtual machine file systems and so on.
  • Block level storage data transportation is much efficient and reliable.
  • Block level storage supports individual formatting of file systems like NFS, NTFS or SMB (Windows) or VMFS (VMware) which are required by the applications.
  • Each storage volume can be treated as an independent disk drive and it can be controlled by external server operating system.
  • Block level storage uses iSCSI and FCoE protocols for data transfer as SCSI commands act as communication interface in between the initiator and the target.

 

Block level storage

Anyone who has used a Storage Area Network (SAN) has probably used block level storage before. Block level storage presents itself to servers using industry standard Fibre Channel and iSCSI connectivity mechanisms. In its most basic form, think of block level storage as a hard drive in a server except the hard drive happens to be installed in a remote chassis and is accessible using Fibre Channel or iSCSI.

When it comes to flexibility and versatility, you can’t beat block level storage. In a block level storage device, raw storage volumes are created, and then the server-based operating system connects to these volumes and uses them as individual hard drives. This makes block level storage usable for almost any kind of application, including file storage, database storage, virtual machine file system (VMFS) volumes, and more. You can place any kind of file system on block level storage. So, if you’re running Windows, your volumes will be formatted with NTFS; VMware servers will use VMFS.

File level storage devices are often used to share files with users. By creating a block-based volume and then installing an operating system and attaching to that volume, you can share files out using that native operating system. Remember, when you use a block-based volume, you’re basically using a blank hard drive with which you can do anything.

When it comes to backup, many storage devices include replication-type capabilities, but you still need to think about how to protect your workloads. With this type of storage, it’s not unusual for an organization to be able to use operating system native backup tools or third-party backup tools such as Data Protection Manager (DPM) to back up files. Since the storage looks and acts like a normal hard drive, special backup steps don’t need to be taken.

With regard to management complexity, block-based storage devices tend to be more complex than their file-based counterparts; this is the tradeoff you get for the added flexibility. Block storage device administrators must:

  • Carefully manage and dole out storage on a per server basis.
  • Manage storage protection levels (i.e., RAID).
  • Track storage device performance to ensure that performance continues to meet server and application needs.
  • Manage and monitor the storage communications infrastructure (generally iSCSI or Fibre Channel).

From a use case standpoint, there are a lot of applications that make use of this block-level shared storage, including:

  • Databases. This is especially true when you want to cluster databases, since clustered databases need shared storage.
  • Exchange. Although Microsoft has made massive improvements to Exchange, the company still does not support file level or network-based (as in, CIFS or NFS) storage. Only block level storage is supported.
  • VMware. Although VMware can use file level storage via Network File System (NFS), it’s very common to deploy VMware servers that use shared VMFS volumes on block level storage.
  • Server boot. With the right kind of storage device, servers can be configured to boot from block level storage.

File level storage

Although block level storage is extremely flexible, nothing beats the simplicity of file level storage when all that’s needed is a place to dump raw files. After all, simply having a centralized, highly available, and accessible place to store files and folders remains the most critical need in many organizations. These file level devices — usually Network Attached Storage (NAS) devices — provide a lot of space at what is generally a lower cost than block level storage.

File level storage is usually accessible using common file level protocols such as SMB/CIFS (Windows) and NFS (Linux, VMware). In the block level world, you need to create a volume, deploy an OS, and then attach to the created volume; in the file level world, the storage device handles the files and folders on the device. This also means that, in many cases, the file level storage device or NAS needs to handle user access control and permissions assignment. Some devices will integrate into existing authentication and security systems.

On the backup front, file level storage devices sometimes require special handling since they might run non-standard operating systems, so keep that in mind if you decide to go the file level route.

With the caveat that you may need to take some steps with regard to authentication, permissions, and backup, file level-only devices are usually easier to set up than block level devices. In many cases, the process can be as simple as walking through a short configuration tool and moving forward.

If you’re looking for storage that screams — that is, if you need high levels of storage performance — be very careful with the file level option. In most cases, if you need high levels of performance, you should look at the block level options. Block level devices are generally configurable for capacity and performance. Although file-level devices do have a performance component, capacity is usually the bigger consideration.

File level use cases are generally:

  • Mass file storage. When your users simply need a place to store files, file-level devices can make a lot of sense.
  • VMware (think NFS). VMware hosts can connect to storage presented via NFS in addition to using block level storage.

Convergence of block and file storage

The block and file worlds are converging. Some new storage devices include both block and file-level capabilities. So if you are torn about whether to go with block or file, a hybrid/converged device might fit your needs.

 

References :

  1. http://www.iscsi.com/resources/File-Level-Storage-vs-Block-Level-Storage.asp
  2. http://www.techrepublic.com/blog/the-enterprise-cloud/block-level-storage-vs-file-level-storage-a-comparison/

 

iNodes

The “inode” is sometimes referred to as an index node. But what is it? Basically, it is a file structure on a file system. More easily, it is a “database” of all file information except the file contents and the file name.

In a file system, inodes consist roughly of 1% of the total disk space, whether it is a whole storage unit (hard disk,thumb drive, etc.) or a partition on a storage unit. The inode space is used to “track” the files stored on the hard disk. The inode entries store metadata about each file, directory or object, but only points to these structures rather than storing the data. Each entry is 128 bytes in size. The metadata contained about each structure can include the following:

  • Inode number
  • Access Control List (ACL)
  • Extended attribute
  • Direct/indirect disk blocks
  • Number of blocks
  • File access, change and modification time
  • File deletion time
  • File generation number
  • File size
  • File type
  • Group
  • Number of links
  • Owner
  • Permissions
  • Status flags

NOTE: the metadata does not include the file’s name.

Rather than the name, the inode of each file uses a pointer to point to the specific file, directory or object. The pointer is a unique number which usually is referred to as the inode number. For example, to get a listing of an inode number, use the following command:

Code:
$ ls –i filename

You can use the “stat” command to get more information than the inode number:

Code:
$ stat filename

A sample output is shown for both commands:

Code:
ls –i Journal.rtf
buse@Buse-PC:/media/buse/Norton$ ls -i ./Journal.rtf
160 ./Journal.rtf
Code:
stat –i Journal.rtf
buse@Buse-PC:/media/buse/Norton$ stat ./Journal.rtf
File: ‘./Journal.rtf’
Size: 22661 Blocks: 48 IO Block: 4096 regular file
Device: 811h/2065d Inode: 160 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 1000/ buse) Gid: ( 1000/ buse)
Access: 2013-05-26 00:00:00.000000000 -0400
Modify: 2013-05-26 17:58:04.000000000 -0400
Change: 2013-05-26 17:58:02.180000000 -0400
Birth: -

In these cases, you can see that the inode number for the Journal.rtf file is 160 for both commands. An inode number can only change if the file is moved.

For example, the previous output came from the file Journal.rtf. If the file is moved to a different directory, the commands are executed again, the inode numbers are different:

Code:
ls –i Journal.rtf
buse@Buse-PC:/media/buse/Norton/test$ ls -i ./Journal.rtf
372 ./Journal.rtf
Code:
stat –i Journal.rtf
buse@Buse-PC:/media/buse/Norton/test$ stat ./Journal.rtf
File: ‘./Journal.rtf’
Size: 22661 Blocks: 48 IO Block: 4096 regular file
Device: 811h/2065d Inode: 372 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 1000/ buse) Gid: ( 1000/ buse)
Access: 2013-05-26 00:00:00.000000000 -0400
Modify: 2013-05-26 17:58:04.000000000 -0400
Change: 2013-05-26 17:58:02.180000000 -0400
Birth: -

Now, the inode number is 372 rather than 160.

If a file exists that has special characters in the filename that are not present on the keyboard, you can find it difficult to remove the file in question, when not using a Graphical User Interface (GUI). The command “ls –i directory” to get a listing of all the files in a directory and their inode number. To delete the file using the inode number, use the following command:

Code:
find ./ -inum number -exec rm -i {} ;

Insert the inode number in the italicized number section. The file can be deleted by the inode number and not the filename. Be sure to type a ‘y’ (without single quotes) to approve the file should be deleted.

You may be asking, how are the file name and inode numbers associated with each other? A listing of directories and the files contained in them also lists the inode number for each file name. When an application needs a file, the application exchanges the file name for the inode number from the directory listing. After that, the application uses the inode number for a reference to the file. The only times an application does not use the inode number is when the file names are displayed on the screen.

To find the inode numbers of the directories, you can use the command “tree -a -L 1 –inodes / “. The command will output something similar to the following:

Code:

buse@Buse-PC:/media/collier/Norton$ tree -a -L 1 --inodes /
/
├── [2097153] bin
├── [ 524289] boot
├── [3932161] cdrom
├── [ 1026] dev
├── [2883585] etc
├── [3801089] home
├── [ 524291] initrd.img -> boot/initrd.img-3.8.0-23-generic
├── [ 524291] initrd.img.old -> /boot/initrd.img-3.8.0-23-generic
├── [1310721] lib
├── [ 786433] lib32
├── [1966081] lib64
├── [ 11] lost+found
├── [1572865] media
├── [3014657] mnt
├── [2752513] opt
├── [ 1] proc
├── [2359297] root
├── [ 198] run
├── [3145729] sbin
├── [1048577] selinux
├── [2621441] srv
├── [ 1] sys
├── [3407873] tmp
├── [1703937] usr
├── [ 917505] var
├── [ 524547] vmlinuz -> boot/vmlinuz-3.8.0-23-generic
└── [ 524547] vmlinuz.old -> boot/vmlinuz-3.8.0-23-generic
 
23 directories, 4 files

You can see the inode number for the specified directory in the brackets. To see files and subdirectories within a specific directory, the ls command is modified slightly. Rather than listing the file name as was done previously, the directory name is used instead, like the following:
ls –i directory
Simply replace the italicized directory with the directory name you wish to view. The subdirectories and files will be listed with its inode number. Like removing files with an inode number, you can also delete directories with its inode number as well.

The inodes do not manage data redundancy, but when inode data is lost, the file is “moved” to the “lost+found” directory of the storage unit. The filename is moved from the present directory listing to the lost+found directory listing while the inode number is changed, if needed, and matched to the file name in the new directory.

Keep in mind that “moving” a file is done by changing the file name in the directory list and updating the inode number if needed. The inode number is only in two places, the inode structure and the directory listing.

Except for inode corruption, another concern is that the drive space can be filled, or more seriously, the inode structure can be filled. The inode structure has a limited space and can be filled before the data portion of the storage unit. For example, a storage unit can contain numerous small files. Each file takes up 128 bytes of the inode structure. If the inode structure fills up before the data storage of the disk, no more files can be copied to the disk. Once inode storage is freed in the structure, the storage unit can have files written to it again.

Inode numbers are unique, but you may have noticed that some file name and inode number listings do show some files with the same number. The duplication is caused by hard links. Hard links are made when a file is copied in multiple directories. The same file exists in various directories on the same storage unit. The directory listing shows two files with the same number which links them to the same physical on te storage unit. Hard links allow for the same file to “exist” in multiple directories, but only one physical file exists. Space is then saved on the storage unit. For example, if a one megabyte file is placed in two different directories, the space used on the storage is one megabyte, not two megabytes.

Deleting files causes the size and direct/indirect block entries are zeroed and the physical space on the storage unit is set as unused. To undelete the file, the metadata is restored from the Journal if it is used (see the Journal article). Once the metadata is restored, the file is once again accessible unless the physical data has been overwritten on the storage unit.

The direct block is the pointer of the first block, or header, of the physical file. The indirect blocks are a listing of every block that contains a portion of the file (not the header). You can imagine the list can be extensive for very large files, but this problem can be resolved with extents (see the Extents article).

Reference :

NTFS vs FAT

NTFS

Short for New Technology File System, NTFS is a proprietary file system developed by Microsoft . It is a file organizational system that stores and accesses information located on Microsoft Windows NT,Windows 2000, Windows XP operating systems. NTFS offers better methods of data protection and file recovery than the previous FAT file system versions.

NTFS

NTFS is the preferred file system for this version of Windows. It has many benefits over the earlier FAT32 file system, including:

  • The capability to recover from some disk-related errors automatically, which FAT32 cannot.

  • Improved support for larger hard disks.

  • Better security because you can use permissions and encryption to restrict access to specific files to approved users.

FAT

Short for File Allocation Table, FAT is a method of keeping track of the contents of a hard drive used by early Microsoft operating systems. The table is a chart of numbers that correspond to cluster addresses on the hard drive. Below is a listing of the different types of FAT that have been used and the operating systems using them.

FAT32

Enhanced File Allocation Table utilizing a 28-bit binary system, first used in Windows 95 OSR2 and Windows 98, that saves disk space by using 4k Cluster. See FAT32 Page for extended information about FAT32.

Today, later versions of Microsoft Windows, such as Windows XP, Vista, and 7 are using NTFS and not FAT.

Reference :

  1. http://www.computerhope.com/jargon/f/fat.htm
  2. http://windows.microsoft.com/en-us/windows-vista/comparing-ntfs-and-fat-file-systems
  3. http://www.howtogeek.com/177529/htg-explains-why-are-removable-drives-still-using-fat32-instead-of-ntfs/

 

Cache Consistency Problem in DFS

The concept of caching is simple. If the data needed to satisfy the access request are not already cached, then a copy of those data is brought from the server to the client system. Accesses are performed on the cached copy. The idea is to retain recently accessed disk blocks in the cache, so that repeated accesses to the same information can be handled locally, without additional network traffic. A replacement policy (for example, the least-recently-used algorithm) keeps the cache size bounded. No direct correspondence exists between accesses and traffic to the server. Files are still identified with one master copy residing at the server machine, but copies (or parts) of the file are scattered in different caches. When a cached copy is modified, the changes need to be reflected on the master copy to preserve the relevant consistency semantics. The problem of keeping the cached copies consistent with the master file is the cache-consistency problem. DFS caching could just as easily be called network virtual memory.

Cache Consistency

A client machine is sometimes faced with the problem of deciding whether a locally cached copy of data is consistent with the master copy (and hence can be used). If the client machine determines that its cached data are out of date, it must cache an up-to-date copy of the data before allowing further accesses.

There are two approaches to verifying the validity of cached data:

1. Client-initiated approach.

The client initiates a validity check, in which it contacts the server and checks whether the local data are consistent with the master copy. The frequency of the validity checking is the crux of this approach and determines the resulting consistency semantics. It can range from a check before every access to a check only on first access to a file (on file open, basically). Every access coupled with a validity check is delayed, compared with an access served immediately by the cache. Alternatively, checks can be initiated at fixed time intervals. Depending on its frequency, the validity check can load both the network and the server.
2. Server-initiated approach.

The server records, for each client, the files (or parts of files) that it caches. When the server detects a potential inconsistency, it must react. A potential for inconsistency occurs when two different clients in conflicting modes cache a file. If UNIX semantics is implemented,we can resolve the potential inconsistency by having the server play an active role. The server must be notified whenever a file is opened, and the intended mode (read or write) must be indicated for every open. The server can then act when it detects that a file has been opened simultaneously in conflicting modes by disabling caching for that particular file. Actually, disabling caching results in switching to a remote-service mode of operation.

Cache Location

Where should the cached data be stored—on disk or in main memory? Disk caches have one clear advantage over main-memory caches: they are reliable. Modifications to cached data are lost in a crash if the cache is kept in volatile memory.Moreover, if the cached data are kept on disk, they are still there during recovery, and there is no need to fetch them again. Main-memory caches have several advantages of their own, however:

• Main-memory caches permit workstations to be diskless.
• Data can be accessed more quickly from a cache in main memory than from one on a disk.
• Technology is moving toward larger and less expensive memory. The resulting performance speedup is predicted to outweigh the advantages of disk caches.
• The server caches (used to speed up disk I/O) will be in main memory regardless of where user caches are located; if we use main-memory caches on the user machine, too, we can build a single caching mechanism for use by both servers and users.

The NFS protocol and most implementations do not provide disk caching.

Cache-Update Policy

The policy used to write modified data blocks back to the server’s master copy has a critical effect on the system’s performance and reliability. The simplest policy is to write data through to disk as soon as they are placed in any cache. The advantage of a write-through policy is reliability: little information is lost when a client system crashes. However, this policy requires each write access to wait until the information is sent to the server, so it causes poor write performance. Caching with write-through is equivalent to using remote service for write accesses and exploiting caching only for read accesses.

An alternative is the delayed-write policy, also known as write-back caching,where we delay updates to the master copy. Modifications are written to the cache and then are written through to the server at a later time. This policy has two advantages over write-through. First, because writes are made to the cache, write accesses complete much more quickly. Second, data may be overwritten before they are written back, in which case only the last update needs to be written at all. Unfortunately, delayed-write schemes introduce
reliability problems, since unwritten data are lost whenever a user machine crashes.

Variations of the delayed-write policy differ in when modified data blocks are flushed to the server. One alternative is to flush a block when it is about to be ejected from the client’s cache. This option can result in good performance, but some blocks can reside in the client’s cache a long time before they are written back to the server. A compromise between this alternative and the write-through policy is to scan the cache at regular intervals and to flush blocks that have been modified since the most recent scan, just as UNIX scans its local cache. Sprite uses this policy with a 30-second interval. NFS uses the policy for file data, but once a write is issued to the server during a cache flush, the write must reach the server’s disk before it is considered complete. NFS treats metadata (directory data and file-attribute data) differently. Any metadata changes are issued synchronously to the server. Thus, file-structure loss and directory-structure corruption are avoided when a client or the server crashes.

Yet another variation on delayed write is to write data back to the server when the file is closed. This write-on-close policy is used in OpenAFS. In the case of files that are open for short periods or are modified rarely, this policy does not significantly reduce network traffic. In addition, the write-on-close policy requires the closing process to delay while the file is written through, which reduces the performance advantages of delayed writes. For files that are open for long periods and are modified frequently, however, the performance advantages of this policy over delayed write with more frequent flushing are
apparent.

Hadoop & HDFS

Hadoop Distributed File System – HDFS

The Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. HDFS is a distributed file system that provides high-performance access to data across Hadoop clusters. Like other Hadoop-related technologies, HDFS has become a key tool for managing pools of big data and supporting big data analytics applications.

Because HDFS typically is deployed on low-cost commodity hardware, server failures are common. The file system is designed to be highly fault-tolerant, however, by facilitating the rapid transfer of data between compute nodes and enabling Hadoop systems to continue running if a node fails. That decreases the risk of catastrophic failure, even in the event that numerous nodes fail.

When HDFS takes in data, it breaks the information down into separate pieces and distributes them to different nodes in a cluster, allowing for parallel processing. The file system also copies each piece of data multiple times and distributes the copies to individual nodes, placing at least one copy on a different server rack than the others. As a result, the data on nodes that crash can be found elsewhere within a cluster, which allows processing to continue while the failure is resolved.

HDFS is built to support applications with large data sets, including individual files that reach into the terabytes. It uses a master/slave architecture, with each cluster consisting of a single NameNode that manages file system operations and supporting DataNodes that manage data storage on individual compute nodes.

An example of HDFS

Think of a file that contains the phone numbers for everyone in the United States; the people with a last name starting with A might be stored on server 1, B on server 2, and so on. In a Hadoop world, pieces of this phonebook would be stored across the cluster, and to reconstruct the entire phonebook, your program would need the blocks from every server in the cluster. To achieve availability as components fail, HDFS replicates these smaller pieces onto two additional servers by default. (This redundancy can be increased or decreased on a per-file basis or for a whole environment; for example, a development Hadoop cluster typically doesn’t need any data re­dundancy.) This redundancy offers multiple benefits, the most obvious being higher availability.

In addition, this redundancy allows the Hadoop cluster to break work up into smaller chunks and run those jobs on all the servers in the cluster for better scalability. Finally, you get the benefit of data locality, which is critical when working with large data sets.

What is Apache Hadoop ?

Hadoop, formally called Apache Hadoop, is an Apache Software Foundation project and open source software platform for scalable,distributed computing. Hadoop can provide fast and reliable analysis of both structured data and unstructured data. Given its capabilities to handle large data sets, it’s often associated with the phrase big data.

The Apache Hadoop software library is essentially a framework that allows for the distributed processing of large datasets across clusters of computers using a simple programming model. Hadoop can scale up from single servers to thousands of machines, each offering local computation and storage.

The Apache Hadoop software library can detect and handle failures at the application layer, so it can deliver a highly-available service on top of a cluster of computers, each of which may be prone to failures.

As listed on the The Apache Hadoop project Web site, the Apache Hadoop project includes a number of subprojects, including:

Hadoop Map Reduce

Hadoop MapReduce (Hadoop Map/Reduce) is a software framework for distributed processing of large data sets on compute clusters of commodity hardware. It is a sub-project of the Apache Hadoop project. The framework takes care of scheduling tasks, monitoring them and re-executing any failed tasks.

According to The Apache Software Foundation, the primary objective of Map/Reduce is to split the input data set into independent chunks that are processed in a completely parallel manner. The Hadoop MapReduce framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically, both the input and the output of the job are stored in a file system.

What is MapReduce?

MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job.

The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers. Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change. This simple scalability is what has attracted many programmers to use the MapReduce model.

The Algorithm

  • Generally MapReduce paradigm is based on sending the computer to where the data resides!
  • MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
    • Map stage : The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.
    • Reduce stage : This stage is the combination of the Shufflestage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.
  • During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster.
  • The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the cluster between the nodes.
  • Most of the computing takes place on nodes with data on local disks that reduces the network traffic.
  • After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.
  • MapReduce Algorithm

 

  • NamedNode – Node that manages the Hadoop Distributed File System (HDFS).
  • DataNode – Node where data is presented in advance before any processing takes place.
  • MasterNode – Node where JobTracker runs and which accepts job requests from clients.
  • SlaveNode – Node where Map and Reduce program runs.

Reference :

  1. https://www-01.ibm.com/software/data/infosphere/hadoop/hdfs/
  2. http://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm

 

 

Buffer Vs Cache

Buffer :

A buffer, of course, is a memory area that stores data being transferred between two devices or between a device and an application. Buffering is done for three
reasons.One reason is to cope with a speed mismatch between the producer and
consumer of a data stream. Suppose, for example, that a file is being received
via modem for storage on the hard disk. The modem is about a thousand
times slower than the hard disk. So a buffer is created in main memory to
accumulate the bytes received from the modem. When an entire buffer of data
has arrived, the buffer can be written to disk in a single operation. Since the
disk write is not instantaneous and the modem still needs a place to store
additional incoming data, two buffers are used. After the modem fills the first
buffer, the disk write is requested. The modem then starts to fill the second
buffer while the first buffer is written to disk. By the time the modem has filled
the second buffer, the disk write from the first one should have completed,
so the modem can switch back to the first buffer while the disk writes the
second one. This double buffering decouples the producer of data from the
consumer, thus relaxing timing requirements between them.

A second use of buffering is to provide adaptations for devices that
have different data-transfer sizes. Such disparities are especially common in
computer networking, where buffers are used widely for fragmentation and
reassembly of messages. At the sending side, a large message is fragmented into small network packets. The packets are sent over the network, and the
receiving side places them in a reassembly buffer to form an image of the
source data.

A third use of buffering is to support copy semantics for application I/O.
An example will clarify the meaning of “copy semantics.” Suppose that an
application has a buffer of data that it wishes to write to disk. It calls the
write() system call, providing a pointer to the buffer and an integer specifying
the number of bytes to write. After the system call returns, what happens if
the application changes the contents of the buffer? With copy semantics, the
version of the data written to disk is guaranteed to be the version at the
time of the application system call, independent of any subsequent changes
in the application’s buffer. A simple way in which the operating system can
guarantee copy semantics is for the write() system call to copy the application
data into a kernel buffer before returning control to the application. The disk
write is performed from the kernel buffer, so that subsequent changes to the
application buffer have no effect. Copying of data between kernel buffers and
application data space is common in operating systems, despite the overhead
that this operation introduces, because of the clean semantics. The same effect
can be obtained more efficiently by clever use of virtual memory mapping and
copy-on-write page protection.

Caching

A cache is a region of fast memory that holds copies of data. Access to the cached
copy is more efficient than access to the original. For instance, the instructions of the currently running process are stored on disk, cached in physical memory,
and copied again in the CPU’s secondary and primary caches.

The difference between a buffer and a cache is that a buffer may hold the only existing copy of a data item, whereas a cache, by definition, holds a copy on faster storage of
an item that resides elsewhere.

Caching and buffering are distinct functions, but sometimes a region of memory can be used for both purposes. For instance, to preserve copy semantics and to enable efficient scheduling of disk I/O, the operating system uses buffers in main memory to hold disk data. These buffers are also used as a cache, to improve the I/O efficiency for files that are shared by applications or that are being written and reread rapidly. When the kernel receives a file I/O request, the kernel first accesses the buffer cache to see whether that region of the file is already available in main memory. If it is, a physical disk I/O can be avoided or deferred. Also, disk writes are accumulated in the buffer cache for several seconds, so that large transfers are gathered to allow efficient write schedules.

A spool is a buffer that holds output for a device, such as a printer, that cannot
accept interleaved data streams. Although a printer can serve only one job
at a time, several applications may wish to print their output concurrently,
without having their output mixed together. The operating system solves this
problem by intercepting all output to the printer. Each application’s output
is spooled to a separate disk file. When an application finishes printing, the
spooling system queues the corresponding spool file for output to the printer.
The spooling system copies the queued spool files to the printer one at a time. In
some operating systems, spooling is managed by a system daemon process. In
others, it is handled by an in-kernel thread. In either case, the operating system
provides a control interface that enables users and system administrators to
display the queue, remove unwanted jobs before those jobs print, suspend
printing while the printer is serviced, and so on.

File System Concepts

File Concept :

  • The operating system abstracts from the physical properties of its storage devices to define a logical storage unit, the file.
  • Files are mapped by the operating system onto physical devices. These storage devices are usually nonvolatile, so the contents are persistent between system reboots.
  • A file is a named collection of related information that is recorded on
    secondary storage
  • A file has a certain defined structure, which depends on its type. A text file is a sequence of characters organized into lines (and possibly pages). A source file is a sequence of functions, each of which is further organized as declarations followed by executable statements. An executable file is a series of code sections that the loader can bring into memory and execute.
  • A file system can be created on each of these parts of the disk. Any entity containing a file system is generally known as a volume. The volume may be a subset of a device, a whole device, or multiple devices linked together into a RAID set. Each volume can be thought of as a virtual disk.
  • Volumes can also store multiple operating systems, allowing a system to boot and run more than one operating system.
  • Path names can be of two types: absolute and relative. An absolute path
    name begins at the root and follows a path down to the specified file, giving
    the directory names on the path. A relative path name defines a path from the
    current directory.
  • The first implemented file sharing is method involves manually transferring files between machines via programs like ftp. The second major method uses a distributed file system in which remote directories are visible from a local machine. The third method is through WWW.
  • Once the remote file system is mounted, file operation requests are sent on behalf of the user across the network to the server via the DFS protocol.

File Access Mechanisms

File access mechanism refers to the manner in which the records of a file may be accessed. There are several ways to access files

  • Sequential access
  • Direct/Random access
  • Indexed sequential access

Sequential access

A sequential access is that in which the records are accessed in some sequence i.e the information in the file is processed in order, one record after the other. This access method is the most primitive one. Example: Compilers usually access files in this fashion.

Direct/Random access

  • Random access file organization provides, accessing the records directly.
  • Each record has its own address on the file with by the help of which it can be directly accessed for reading or writing.
  • The records need not be in any sequence within the file and they need not be in adjacent locations on the storage medium.

Indexed sequential access

  • This mechanism is built up on base of sequential access.
  • An index is created for each file which contains pointers to various blocks.
  • Index is searched sequentially and its pointer is used to access the file directly.

Space Allocation

Files are allocated disk spaces by operating system. Operating systems deploy following three main ways to allocate disk space to files.

  • Contiguous Allocation
  • Linked Allocation
  • Indexed Allocation

Contiguous Allocation

  • Each file occupy a contiguous address space on disk.
  • Assigned disk address is in linear order.
  • Easy to implement.
  • External fragmentation is a major issue with this type of allocation technique.

Linked Allocation

  • Each file carries a list of links to disk blocks.
  • Directory contains link / pointer to first block of a file.
  • No external fragmentation
  • Effectively used in sequential access file.
  • Inefficient in case of direct access file.

Indexed Allocation

  • Provides solutions to problems of contigous and linked allocation.
  • A index block is created having all pointers to files.
  • Each file has its own index block which stores the addresses of disk space occupied by the file.
  • Directory contains the addresses of index blocks of files.

 Distributed Information Systems

To make client server systems easier to manage, distributed information systems also known as distributed naming services provide unified access to the information needed for remote computing. The domain name system provides host name to network address translations for the entire Internet
Summary from Text Book :
A file is an abstract data type defined and implemented by the operating
system. It is a sequence of logical records. A logical record may be a byte, a line
(of fixed or variable length), or a more complex data item. The operating system
may specifically support various record types or may leave that support to the
application program.
The major task for the operating system is to map the logical file concept
onto physical storage devices such as magnetic disk or tape. Since the physical
record size of the device may not be the same as the logical record size, it may
be necessary to order logical records into physical records. Again, this task may
be supported by the operating system or left for the application program.
Each device in a file system keeps a volume table of contents or a device
directory listing the location of the files on the device. In addition, it is useful
to create directories to allow files to be organized. A single-level directory
in a multiuser system causes naming problems, since each file must have a
unique name. A two-level directory solves this problem by creating a separate
directory for each user’s files. The directory lists the files by name and includes
the file’s location on the disk, length, type, owner, time of creation, time of last
use, and so on.
The natural generalization of a two-level directory is a tree-structured
directory. A tree-structured directory allows a user to create subdirectories
to organize files. Acyclic-graph directory structures enable users to share
subdirectories and files but complicate searching and deletion.Ageneral graph
structure allows complete flexibility in the sharing of files and directories but
sometimes requires garbage collection to recover unused disk space.
Disks are segmented into one or more volumes, each containing a file
system or left “raw.” File systems may be mounted into the system’s naming structures to make them available. The naming scheme varies by operating
system. Once mounted, the files within the volume are available for use. File
systems may be unmounted to disable access or for maintenance.
File sharing depends on the semantics provided by the system. Files may
have multiple readers, multiple writers, or limits on sharing. Distributed file
systems allow client hosts to mount volumes or directories fromservers, as long
as they can access each other across a network. Remote file systems present
challenges in reliability, performance, and security. Distributed information
systems maintain user, host, and access information so that clients and servers
can share state information to manage use and access.
Since files are the main information-storage mechanism in most computer
systems, file protection is needed. Access to files can be controlled separately
for each type of access—read, write, execute, append, delete, list directory,
and so on. File protection can be provided by access lists, passwords, or other
techniques.
To improve I/O efficiency, I/O transfers between memory and disk are
performed in units of blocks. Each block has one or more sectors. Depending on the disk drive, sector size varies from 32 bytes to 4,096 bytes; the usual size
is 512 bytes.
The basic file system needs only to issue generic commands to the
appropriate device driver to read and write physical blocks on the disk. Each
physical block is identified by its numeric disk address (for example, drive 1,
cylinder 73, track 2, sector 10).

File System – OS Concepts

 

Reference :

  1. https://www.scribd.com/doc/76762983/Operating-Systems-Files-Concept-implementing-File-Systems#scribd
  2. https://www.scribd.com/user/18149086/Mukesh