SlideShare uma empresa Scribd logo
1 de 4
Baixar para ler offline
Part 0 – Answer important argument
                          Art of Distributed
                       ‘Why Distributed Computing’
                          Haytham ElFadeel - Hfadeel@acm.org




Abstract
Day after day distributed systems tend to appear in mainstream systems. In spite of ma-
turity of distributed frameworks and tools, a lot of distributed technologies still ambi-
guous for a lot of people, especially if we talks about complicated or new stuff.
Lately I have been getting a lot of questions about distributed computing, especially dis-
tributed computing models, and MapReduce, such as: What is MapReduce? Can Ma-
pReduce fit in all situations? How can we compares it with other technologies such as
Grid Computing? And what is the best solution to our situation? So I decided to write
about distributed computing article into two parts. The first, about distributed compu-
ting models and the differences between them. The second part, I will discuss reliability
of the distrusted system and distributed storage systems. So let’s start…

Introduction:
After publish the part 1 of this article that entitled ‘Rethinking about the Distributed
Computing’, I received a few message doubting about Distributed Computing. So I de-
cide to go back and write Part 0 – Answer important argument ‘Why Distributed Com-
puting’.

Why Distributed Computing:
A log of challenges require involving of huge amount of Data, Image if you want to
crawling the whole Web, perform Monte Carlo simulation, or render movie animation,
etc. Such challenges and many others cannot perform in one PC even.
Usually we resort to Distributed Computing when we want deal with huge amount of
data. So the question is what is the big data?




Art of Distributed - Part 0: Why Distributed Computing.
Haytham ElFadeel - Hfadeel@acm.org
The Big Data - Experience with single machine:
Today computers have several Gigabytes or a few Terabytes of storage system capacity.
But what is ‘big data’ anyway? Gigabytes? Terabytes? Petabytes?
A database on the order of 100 GB would not be considered trivially small even today,
although hard drives capable of storing 10 times as much can be had for less than $100
at any computer store. The U.S. Census database included many different datasets of
varying sizes, but let’s simplify a bit: 100 gigabytes is enough to store at least the basic
demographic information: age, sex, income, ethnicity, language, religion, housing status,
and location, packed in a 128-bit record—for every living human being on this planet.
This would create a table of 6.75 billion rows and maybe 10 columns. Should that still be
considered ‘big data’? It depends, on what you’re trying to do with it. More important-
ly, any competent programmer could in a few hours write a simple, unoptimized appli-
cation on a $500 desktop PC with minimal CPU and RAM that could crunch through that
dataset and return answers to simple aggregation queries such as ‘what is the median
age by sex for each country?’ with perfectly reasonable performance.
To demonstrate this, I tried it with fake data, a file consisting of 6.75 billion 16-byte (i.e.
128-bit) records containing uniformly distributed random data. Since a seven-bit age
field allows a maximum of 128 possible values, one bit for sex allows only two, and eight
bits for country allows up to 256 option, we can calculate the median age by using a
counting strategy: simply create 65,536 buckets-one for each combination of age, sex,
and country-and count how many records fall into each. We find the median age by de-
termining, for each sex and country group, the cumulative count over the 128 age buck-
ets: the median is the bucket where the count reaches half of the total. [See figure 1].




 In my tests, this algorithm was limited primarily by the speed of disk fetching: In aver-
age 90-megabyte-per-second sustained read speed, shamefully underutilizing the CPU
the whole time.



Art of Distributed - Part 0: Why Distributed Computing.
Haytham ElFadeel - Hfadeel@acm.org
In fact, our table will fit in the memory of a single, $15,000 Dell server with 128-GB
RAM. Running off in-memory data, my simple median-age-by-sex-and-country program
completed in less than a minute. By such measures, I would hesitate to call this ‘big da-
ta’. This application will need in average between 725-second, and 1000-second to per-
form the above quire. But what if the data bigger than this, you want perform more
complex quires, or what if this data stored into Database system which means every
record will take only 128-bit??? Actually if you add more data or make the query more
complex the time will likely to get exponential or at least multiply!
So the question answer is: It’s depends on what you’re trying to do with the Data.

Hard Limit:
Hardware and software failures are important limit, there is no way to predict or protect
your system from the failures. Even if you use a High-end Server or expensive hardware
the failure is likely to occur. So in single machine you likely are going to lose your whole
work! But if you distribute the work across many machines you will lose one part of the
work and you can resume it into another machine. Actually the modern distributed
strategies show its ability and reliable to handle the hardware and software failures,
Systems and Computing Model such as: MapReduce, Dryad, Hadoop, Dynamo, Hadoop
all designed to handle the unexpected failures.
Of course, hardware—chiefly memory and CPU limitations—is often a major factor in
software limits on dataset size. Many applications are designed to read entire datasets
into memory and work with them there; a good example of this is the popular statistical
computing environment R.7 Memory-bound applications naturally exhibit higher per-
formance than disk-bound ones (at least insofar as the data-crunching they carry out
advances beyond single-pass, purely sequential processing), but requiring all data to fit
in memory means that if you have a dataset larger than your installed RAM, you’re out
of luck. So in this case do you will wait until the next generation of the hardware come?!
On most hardware platforms, there’s a much harder limit on memory expansion than
disk expansion: the motherboard has only so many slots to fill.
The problem often goes further than this, however. Like most other aspects of comput-
er hardware, maximum memory capacities increase with time; 32 GB is no longer a rare
configuration for a desktop workstation, and servers are frequently configured with far
more than that. There is no guarantee, however, that a memory-bound application will
be able to use all installed RAM. Even under modern 64-bit operating systems, many




Art of Distributed - Part 0: Why Distributed Computing.
Haytham ElFadeel - Hfadeel@acm.org
applications today (e.g., R under Windows) have only 32-bit executables and are limited
to 4-GB address spaces—this often translates into a 2- or 3-GB working set limitation.
Finally, even where a 64-bit binary is available—removing the absolute address space
limitation—all too often relics from the age of 32-bit code still pervade software, partic-
ularly in the use of 32-bit integers to index array elements. Thus, for example, 64-bit
versions of R (available for Linux and Mac) use signed 32-bit integers to represent
lengths, limiting data frames to at most 231-1, or about 2 billion rows. Even on a 64-bit
system with sufficient RAM to hold the data, therefore, a 6.75-billion-row dataset such
as the earlier world census example ends up being too big for R to handle.




Art of Distributed - Part 0: Why Distributed Computing.
Haytham ElFadeel - Hfadeel@acm.org

Mais conteúdo relacionado

Destaque

P0 n 02 re 01 recepción de materias primas y productos semielaborados (2) (1)
P0 n 02 re 01 recepción de materias primas y productos semielaborados (2) (1)P0 n 02 re 01 recepción de materias primas y productos semielaborados (2) (1)
P0 n 02 re 01 recepción de materias primas y productos semielaborados (2) (1)transportemultisolucione
 
Do not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textDo not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textGeorge Ang
 
Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...George Ang
 
Huffman coding
Huffman codingHuffman coding
Huffman codingGeorge Ang
 
Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarizationGeorge Ang
 
Resultados de las elecciones en Bolivia
Resultados de las elecciones en BoliviaResultados de las elecciones en Bolivia
Resultados de las elecciones en BoliviaJesús Alanoca
 

Destaque (6)

P0 n 02 re 01 recepción de materias primas y productos semielaborados (2) (1)
P0 n 02 re 01 recepción de materias primas y productos semielaborados (2) (1)P0 n 02 re 01 recepción de materias primas y productos semielaborados (2) (1)
P0 n 02 re 01 recepción de materias primas y productos semielaborados (2) (1)
 
Do not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textDo not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar text
 
Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...
 
Huffman coding
Huffman codingHuffman coding
Huffman coding
 
Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarization
 
Resultados de las elecciones en Bolivia
Resultados de las elecciones en BoliviaResultados de las elecciones en Bolivia
Resultados de las elecciones en Bolivia
 

Semelhante a Art Of Distributed P0

Big Data
Big DataBig Data
Big DataNGDATA
 
Distributed Computing & MapReduce
Distributed Computing & MapReduceDistributed Computing & MapReduce
Distributed Computing & MapReducecoolmirza143
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformGeekNightHyderabad
 
Waters Grid & HPC Course
Waters Grid & HPC CourseWaters Grid & HPC Course
Waters Grid & HPC Coursejimliddle
 
Development of resource-intensive applications in Visual C++
Development of resource-intensive applications in Visual C++Development of resource-intensive applications in Visual C++
Development of resource-intensive applications in Visual C++Andrey Karpov
 
Grid computing dis
Grid computing disGrid computing dis
Grid computing disgopishna09
 
Development of resource-intensive applications in Visual C++
Development of resource-intensive applications in Visual C++Development of resource-intensive applications in Visual C++
Development of resource-intensive applications in Visual C++PVS-Studio
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An OverviewArvind Kalyan
 
Databases benoitg 2009-03-10
Databases benoitg 2009-03-10Databases benoitg 2009-03-10
Databases benoitg 2009-03-10benoitg
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Blades for HPTC
Blades for HPTCBlades for HPTC
Blades for HPTCGuy Coates
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training reportSarvesh Meena
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151xlight
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 

Semelhante a Art Of Distributed P0 (20)

Seminar
SeminarSeminar
Seminar
 
Big Data
Big DataBig Data
Big Data
 
Distributed Computing & MapReduce
Distributed Computing & MapReduceDistributed Computing & MapReduce
Distributed Computing & MapReduce
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data Platform
 
Waters Grid & HPC Course
Waters Grid & HPC CourseWaters Grid & HPC Course
Waters Grid & HPC Course
 
Development of resource-intensive applications in Visual C++
Development of resource-intensive applications in Visual C++Development of resource-intensive applications in Visual C++
Development of resource-intensive applications in Visual C++
 
Grid computing dis
Grid computing disGrid computing dis
Grid computing dis
 
Development of resource-intensive applications in Visual C++
Development of resource-intensive applications in Visual C++Development of resource-intensive applications in Visual C++
Development of resource-intensive applications in Visual C++
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
Vectorization whitepaper
Vectorization whitepaperVectorization whitepaper
Vectorization whitepaper
 
Databases benoitg 2009-03-10
Databases benoitg 2009-03-10Databases benoitg 2009-03-10
Databases benoitg 2009-03-10
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Computer architecture
Computer architectureComputer architecture
Computer architecture
 
bigdata 2.pptx
bigdata 2.pptxbigdata 2.pptx
bigdata 2.pptx
 
Blades for HPTC
Blades for HPTCBlades for HPTC
Blades for HPTC
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
bigdata.pptx
bigdata.pptxbigdata.pptx
bigdata.pptx
 

Mais de George Ang

腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道George Ang
 
腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化George Ang
 
腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间George Ang
 
腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨George Ang
 
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站George Ang
 
腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程George Ang
 
腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagementGeorge Ang
 
腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享George Ang
 
腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍George Ang
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍George Ang
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍George Ang
 
腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享George Ang
 
腾讯大讲堂17 性能优化不是仅局限于后台(qzone)
腾讯大讲堂17 性能优化不是仅局限于后台(qzone)腾讯大讲堂17 性能优化不是仅局限于后台(qzone)
腾讯大讲堂17 性能优化不是仅局限于后台(qzone)George Ang
 
腾讯大讲堂18 让我们戴上有色眼镜--qzone前台架构的优化分享
腾讯大讲堂18 让我们戴上有色眼镜--qzone前台架构的优化分享腾讯大讲堂18 让我们戴上有色眼镜--qzone前台架构的优化分享
腾讯大讲堂18 让我们戴上有色眼镜--qzone前台架构的优化分享George Ang
 
腾讯大讲堂19 系统优化的方向
腾讯大讲堂19 系统优化的方向腾讯大讲堂19 系统优化的方向
腾讯大讲堂19 系统优化的方向George Ang
 
腾讯大讲堂13 soso访问速度优化
腾讯大讲堂13 soso访问速度优化腾讯大讲堂13 soso访问速度优化
腾讯大讲堂13 soso访问速度优化George Ang
 
腾讯大讲堂21 搜索引擎优化(seo)简介
腾讯大讲堂21 搜索引擎优化(seo)简介腾讯大讲堂21 搜索引擎优化(seo)简介
腾讯大讲堂21 搜索引擎优化(seo)简介George Ang
 
腾讯大讲堂24 qq show2.0重构历程
腾讯大讲堂24 qq show2.0重构历程腾讯大讲堂24 qq show2.0重构历程
腾讯大讲堂24 qq show2.0重构历程George Ang
 
腾讯大讲堂25 企业级搜索托管平台介绍
腾讯大讲堂25 企业级搜索托管平台介绍腾讯大讲堂25 企业级搜索托管平台介绍
腾讯大讲堂25 企业级搜索托管平台介绍George Ang
 
腾讯大讲堂26 带宽优化之道
腾讯大讲堂26 带宽优化之道腾讯大讲堂26 带宽优化之道
腾讯大讲堂26 带宽优化之道George Ang
 

Mais de George Ang (20)

腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道
 
腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化
 
腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间
 
腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨
 
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站
 
腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程
 
腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement
 
腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享
 
腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
 
腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享
 
腾讯大讲堂17 性能优化不是仅局限于后台(qzone)
腾讯大讲堂17 性能优化不是仅局限于后台(qzone)腾讯大讲堂17 性能优化不是仅局限于后台(qzone)
腾讯大讲堂17 性能优化不是仅局限于后台(qzone)
 
腾讯大讲堂18 让我们戴上有色眼镜--qzone前台架构的优化分享
腾讯大讲堂18 让我们戴上有色眼镜--qzone前台架构的优化分享腾讯大讲堂18 让我们戴上有色眼镜--qzone前台架构的优化分享
腾讯大讲堂18 让我们戴上有色眼镜--qzone前台架构的优化分享
 
腾讯大讲堂19 系统优化的方向
腾讯大讲堂19 系统优化的方向腾讯大讲堂19 系统优化的方向
腾讯大讲堂19 系统优化的方向
 
腾讯大讲堂13 soso访问速度优化
腾讯大讲堂13 soso访问速度优化腾讯大讲堂13 soso访问速度优化
腾讯大讲堂13 soso访问速度优化
 
腾讯大讲堂21 搜索引擎优化(seo)简介
腾讯大讲堂21 搜索引擎优化(seo)简介腾讯大讲堂21 搜索引擎优化(seo)简介
腾讯大讲堂21 搜索引擎优化(seo)简介
 
腾讯大讲堂24 qq show2.0重构历程
腾讯大讲堂24 qq show2.0重构历程腾讯大讲堂24 qq show2.0重构历程
腾讯大讲堂24 qq show2.0重构历程
 
腾讯大讲堂25 企业级搜索托管平台介绍
腾讯大讲堂25 企业级搜索托管平台介绍腾讯大讲堂25 企业级搜索托管平台介绍
腾讯大讲堂25 企业级搜索托管平台介绍
 
腾讯大讲堂26 带宽优化之道
腾讯大讲堂26 带宽优化之道腾讯大讲堂26 带宽优化之道
腾讯大讲堂26 带宽优化之道
 

Último

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Último (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

Art Of Distributed P0

  • 1. Part 0 – Answer important argument Art of Distributed ‘Why Distributed Computing’ Haytham ElFadeel - Hfadeel@acm.org Abstract Day after day distributed systems tend to appear in mainstream systems. In spite of ma- turity of distributed frameworks and tools, a lot of distributed technologies still ambi- guous for a lot of people, especially if we talks about complicated or new stuff. Lately I have been getting a lot of questions about distributed computing, especially dis- tributed computing models, and MapReduce, such as: What is MapReduce? Can Ma- pReduce fit in all situations? How can we compares it with other technologies such as Grid Computing? And what is the best solution to our situation? So I decided to write about distributed computing article into two parts. The first, about distributed compu- ting models and the differences between them. The second part, I will discuss reliability of the distrusted system and distributed storage systems. So let’s start… Introduction: After publish the part 1 of this article that entitled ‘Rethinking about the Distributed Computing’, I received a few message doubting about Distributed Computing. So I de- cide to go back and write Part 0 – Answer important argument ‘Why Distributed Com- puting’. Why Distributed Computing: A log of challenges require involving of huge amount of Data, Image if you want to crawling the whole Web, perform Monte Carlo simulation, or render movie animation, etc. Such challenges and many others cannot perform in one PC even. Usually we resort to Distributed Computing when we want deal with huge amount of data. So the question is what is the big data? Art of Distributed - Part 0: Why Distributed Computing. Haytham ElFadeel - Hfadeel@acm.org
  • 2. The Big Data - Experience with single machine: Today computers have several Gigabytes or a few Terabytes of storage system capacity. But what is ‘big data’ anyway? Gigabytes? Terabytes? Petabytes? A database on the order of 100 GB would not be considered trivially small even today, although hard drives capable of storing 10 times as much can be had for less than $100 at any computer store. The U.S. Census database included many different datasets of varying sizes, but let’s simplify a bit: 100 gigabytes is enough to store at least the basic demographic information: age, sex, income, ethnicity, language, religion, housing status, and location, packed in a 128-bit record—for every living human being on this planet. This would create a table of 6.75 billion rows and maybe 10 columns. Should that still be considered ‘big data’? It depends, on what you’re trying to do with it. More important- ly, any competent programmer could in a few hours write a simple, unoptimized appli- cation on a $500 desktop PC with minimal CPU and RAM that could crunch through that dataset and return answers to simple aggregation queries such as ‘what is the median age by sex for each country?’ with perfectly reasonable performance. To demonstrate this, I tried it with fake data, a file consisting of 6.75 billion 16-byte (i.e. 128-bit) records containing uniformly distributed random data. Since a seven-bit age field allows a maximum of 128 possible values, one bit for sex allows only two, and eight bits for country allows up to 256 option, we can calculate the median age by using a counting strategy: simply create 65,536 buckets-one for each combination of age, sex, and country-and count how many records fall into each. We find the median age by de- termining, for each sex and country group, the cumulative count over the 128 age buck- ets: the median is the bucket where the count reaches half of the total. [See figure 1]. In my tests, this algorithm was limited primarily by the speed of disk fetching: In aver- age 90-megabyte-per-second sustained read speed, shamefully underutilizing the CPU the whole time. Art of Distributed - Part 0: Why Distributed Computing. Haytham ElFadeel - Hfadeel@acm.org
  • 3. In fact, our table will fit in the memory of a single, $15,000 Dell server with 128-GB RAM. Running off in-memory data, my simple median-age-by-sex-and-country program completed in less than a minute. By such measures, I would hesitate to call this ‘big da- ta’. This application will need in average between 725-second, and 1000-second to per- form the above quire. But what if the data bigger than this, you want perform more complex quires, or what if this data stored into Database system which means every record will take only 128-bit??? Actually if you add more data or make the query more complex the time will likely to get exponential or at least multiply! So the question answer is: It’s depends on what you’re trying to do with the Data. Hard Limit: Hardware and software failures are important limit, there is no way to predict or protect your system from the failures. Even if you use a High-end Server or expensive hardware the failure is likely to occur. So in single machine you likely are going to lose your whole work! But if you distribute the work across many machines you will lose one part of the work and you can resume it into another machine. Actually the modern distributed strategies show its ability and reliable to handle the hardware and software failures, Systems and Computing Model such as: MapReduce, Dryad, Hadoop, Dynamo, Hadoop all designed to handle the unexpected failures. Of course, hardware—chiefly memory and CPU limitations—is often a major factor in software limits on dataset size. Many applications are designed to read entire datasets into memory and work with them there; a good example of this is the popular statistical computing environment R.7 Memory-bound applications naturally exhibit higher per- formance than disk-bound ones (at least insofar as the data-crunching they carry out advances beyond single-pass, purely sequential processing), but requiring all data to fit in memory means that if you have a dataset larger than your installed RAM, you’re out of luck. So in this case do you will wait until the next generation of the hardware come?! On most hardware platforms, there’s a much harder limit on memory expansion than disk expansion: the motherboard has only so many slots to fill. The problem often goes further than this, however. Like most other aspects of comput- er hardware, maximum memory capacities increase with time; 32 GB is no longer a rare configuration for a desktop workstation, and servers are frequently configured with far more than that. There is no guarantee, however, that a memory-bound application will be able to use all installed RAM. Even under modern 64-bit operating systems, many Art of Distributed - Part 0: Why Distributed Computing. Haytham ElFadeel - Hfadeel@acm.org
  • 4. applications today (e.g., R under Windows) have only 32-bit executables and are limited to 4-GB address spaces—this often translates into a 2- or 3-GB working set limitation. Finally, even where a 64-bit binary is available—removing the absolute address space limitation—all too often relics from the age of 32-bit code still pervade software, partic- ularly in the use of 32-bit integers to index array elements. Thus, for example, 64-bit versions of R (available for Linux and Mac) use signed 32-bit integers to represent lengths, limiting data frames to at most 231-1, or about 2 billion rows. Even on a 64-bit system with sufficient RAM to hold the data, therefore, a 6.75-billion-row dataset such as the earlier world census example ends up being too big for R to handle. Art of Distributed - Part 0: Why Distributed Computing. Haytham ElFadeel - Hfadeel@acm.org