Hadoop入門進(jìn)階課程4--HDFS原理及操作
本文版權(quán)歸作者和博客園共有,歡迎轉(zhuǎn)載,但未經(jīng)作者同意必須保留此段聲明,且在文章頁(yè)面明顯位置給出原文連接,博主為石山園,博客地址為 http://www.rzrgm.cn/shishanyuan 。該系列課程是應(yīng)邀實(shí)驗(yàn)樓整理編寫的,這里需要贊一下實(shí)驗(yàn)樓提供了學(xué)習(xí)的新方式,可以邊看博客邊上機(jī)實(shí)驗(yàn),課程地址為 https://www.shiyanlou.com/courses/237
【注】該系列所使用到安裝包、測(cè)試數(shù)據(jù)和代碼均可在百度網(wǎng)盤下載,具體地址為 http://pan.baidu.com/s/10PnDs,下載該PDF文件
1、環(huán)境說(shuō)明
部署節(jié)點(diǎn)操作系統(tǒng)為CentOS,防火墻和SElinux禁用,創(chuàng)建了一個(gè)shiyanlou用戶并在系統(tǒng)根目錄下創(chuàng)建/app目錄,用于存放Hadoop等組件運(yùn)行包。因?yàn)樵撃夸浻糜诎惭bhadoop等組件程序,用戶對(duì)shiyanlou必須賦予rwx權(quán)限(一般做法是root用戶在根目錄下創(chuàng)建/app目錄,并修改該目錄擁有者為shiyanlou(chown –R shiyanlou:shiyanlou /app)。
Hadoop搭建環(huán)境:
l 虛擬機(jī)操作系統(tǒng): CentOS6.6 64位,單核,1G內(nèi)存
l JDK:1.7.0_55 64位
l Hadoop:1.1.2
2、HDFS原理
HDFS(Hadoop Distributed File System)是一個(gè)分布式文件系統(tǒng),是谷歌的GFS山寨版本。它具有高容錯(cuò)性并提供了高吞吐量的數(shù)據(jù)訪問(wèn),非常適合大規(guī)模數(shù)據(jù)集上的應(yīng)用,它提供了一個(gè)高度容錯(cuò)性和高吞吐量的海量數(shù)據(jù)存儲(chǔ)解決方案。
l高吞吐量訪問(wèn):HDFS的每個(gè)Block分布在不同的Rack上,在用戶訪問(wèn)時(shí),HDFS會(huì)計(jì)算使用最近和訪問(wèn)量最小的服務(wù)器給用戶提供。由于Block在不同的Rack上都有備份,所以不再是單數(shù)據(jù)訪問(wèn),所以速度和效率是非常快的。另外HDFS可以并行從服務(wù)器集群中讀寫,增加了文件讀寫的訪問(wèn)帶寬。
l高容錯(cuò)性:系統(tǒng)故障是不可避免的,如何做到故障之后的數(shù)據(jù)恢復(fù)和容錯(cuò)處理是至關(guān)重要的。HDFS通過(guò)多方面保證數(shù)據(jù)的可靠性,多份復(fù)制并且分布到物理位置的不同服務(wù)器上,數(shù)據(jù)校驗(yàn)功能、后臺(tái)的連續(xù)自檢數(shù)據(jù)一致性功能都為高容錯(cuò)提供了可能。
l線性擴(kuò)展:因?yàn)?span lang="EN-US">HDFS的Block信息存放到NameNode上,文件的Block分布到DataNode上,當(dāng)擴(kuò)充的時(shí)候僅僅添加DataNode數(shù)量,系統(tǒng)可以在不停止服務(wù)的情況下做擴(kuò)充,不需要人工干預(yù)。
2.1HDFS架構(gòu)
如上圖所示HDFS是Master和Slave的結(jié)構(gòu),分為NameNode、Secondary NameNode和DataNode三種角色。
lNameNode:在Hadoop1.X中只有一個(gè)Master節(jié)點(diǎn),管理HDFS的名稱空間和數(shù)據(jù)塊映射信息、配置副本策略和處理客戶端請(qǐng)求;
lSecondary NameNode:輔助NameNode,分擔(dān)NameNode工作,定期合并fsimage和fsedits并推送給NameNode,緊急情況下可輔助恢復(fù)NameNode;
lDataNode:Slave節(jié)點(diǎn),實(shí)際存儲(chǔ)數(shù)據(jù)、執(zhí)行數(shù)據(jù)塊的讀寫并匯報(bào)存儲(chǔ)信息給NameNode;
2.2HDFS讀操作
1. 客戶端通過(guò)調(diào)用FileSystem對(duì)象的open()方法來(lái)打開希望讀取的文件,對(duì)于HDFS來(lái)說(shuō),這個(gè)對(duì)象時(shí)分布文件系統(tǒng)的一個(gè)實(shí)例;
2. DistributedFileSystem通過(guò)使用RPC來(lái)調(diào)用NameNode以確定文件起始?jí)K的位置,同一Block按照重復(fù)數(shù)會(huì)返回多個(gè)位置,這些位置按照Hadoop集群拓?fù)浣Y(jié)構(gòu)排序,距離客戶端近的排在前面;
3. 前兩步會(huì)返回一個(gè)FSDataInputStream對(duì)象,該對(duì)象會(huì)被封裝成DFSInputStream對(duì)象,DFSInputStream可以方便的管理datanode和namenode數(shù)據(jù)流,客戶端對(duì)這個(gè)輸入流調(diào)用read()方法;
4. 存儲(chǔ)著文件起始?jí)K的DataNode地址的DFSInputStream隨即連接距離最近的DataNode,通過(guò)對(duì)數(shù)據(jù)流反復(fù)調(diào)用read()方法,可以將數(shù)據(jù)從DataNode傳輸?shù)娇蛻舳耍?/span>
5. 到達(dá)塊的末端時(shí),DFSInputStream會(huì)關(guān)閉與該DataNode的連接,然后尋找下一個(gè)塊的最佳DataNode,這些操作對(duì)客戶端來(lái)說(shuō)是透明的,客戶端的角度看來(lái)只是讀一個(gè)持續(xù)不斷的流;
6. 一旦客戶端完成讀取,就對(duì)FSDataInputStream調(diào)用close()方法關(guān)閉文件讀取。
2.3HDFS寫操作
1. 客戶端通過(guò)調(diào)用DistributedFileSystem的create()方法創(chuàng)建新文件;
2. DistributedFileSystem通過(guò)RPC調(diào)用NameNode去創(chuàng)建一個(gè)沒(méi)有Blocks關(guān)聯(lián)的新文件,創(chuàng)建前NameNode會(huì)做各種校驗(yàn),比如文件是否存在、客戶端有無(wú)權(quán)限去創(chuàng)建等。如果校驗(yàn)通過(guò),NameNode會(huì)為創(chuàng)建新文件記錄一條記錄,否則就會(huì)拋出IO異常;
3. 前兩步結(jié)束后會(huì)返回FSDataOutputStream的對(duì)象,和讀文件的時(shí)候相似,FSDataOutputStream被封裝成DFSOutputStream,DFSOutputStream可以協(xié)調(diào)NameNode和Datanode。客戶端開始寫數(shù)據(jù)到DFSOutputStream,DFSOutputStream會(huì)把數(shù)據(jù)切成一個(gè)個(gè)小的數(shù)據(jù)包,并寫入內(nèi)部隊(duì)列稱為“數(shù)據(jù)隊(duì)列”(Data Queue);
4. DataStreamer會(huì)去處理接受Data Queue,它先問(wèn)詢NameNode這個(gè)新的Block最適合存儲(chǔ)的在哪幾個(gè)DataNode里,比如重復(fù)數(shù)是3,那么就找到3個(gè)最適合的DataNode,把他們排成一個(gè)pipeline.DataStreamer把Packet按隊(duì)列輸出到管道的第一個(gè)Datanode中,第一個(gè)DataNode又把Packet輸出到第二個(gè)DataNode中,以此類推;
5. DFSOutputStream還有一個(gè)對(duì)列叫Ack Quene,也是有Packet組成,等待DataNode的收到響應(yīng),當(dāng)Pipeline中的所有DataNode都表示已經(jīng)收到的時(shí)候,這時(shí)Akc Quene才會(huì)把對(duì)應(yīng)的Packet包移除掉;
6. 客戶端完成寫數(shù)據(jù)后調(diào)用close()方法關(guān)閉寫入流;
7. DataStreamer把剩余的包都刷到Pipeline里然后等待Ack信息,收到最后一個(gè)Ack后,通知NameNode把文件標(biāo)示為已完成。
2.4HDFS中常用到的命令
lhadoop fs
hadoop fs -ls /
hadoop fs -lsr
hadoop fs -mkdir /user/hadoop
hadoop fs -put a.txt /user/hadoop/
hadoop fs -get /user/hadoop/a.txt /
hadoop fs -cp src dst
hadoop fs -mv src dst
hadoop fs -cat /user/hadoop/a.txt
hadoop fs -rm /user/hadoop/a.txt
hadoop fs -rmr /user/hadoop/a.txt
hadoop fs -text /user/hadoop/a.txt
hadoop fs -copyFromLocal localsrc dst 與hadoop fs -put功能類似。
hadoop fs -moveFromLocal localsrc dst 將本地文件上傳到hdfs,同時(shí)刪除本地文件。
lhadoop fsadmin
hadoop dfsadmin -report
hadoop dfsadmin -safemode enter | leave | get | wait
hadoop dfsadmin -setBalancerBandwidth 1000
lhadoop fsck
lstart-balancer.sh
相關(guān)HDFS API可以到Apache官網(wǎng)進(jìn)行查看:
3、測(cè)試?yán)?/span>1
3.1測(cè)試?yán)?span lang="EN-US">1內(nèi)容
在Hadoop集群中編譯并運(yùn)行《權(quán)威指南》中的例3.2,讀取HDFS文件內(nèi)容。
3.2 運(yùn)行代碼
1 import java.io.InputStream; 2 3 import java.net.URI; 4 import org.apache.hadoop.conf.Configuration; 5 import org.apache.hadoop.fs.*; 6 import org.apache.hadoop.io.IOUtils; 7 8 public class FileSystemCat { 9 public static void main(String[] args) throws Exception { 10 String uri = args[0]; 11 Configuration conf = new Configuration(); 12 FileSystem fs = FileSystem. get(URI.create (uri), conf); 13 InputStream in = null; 14 try { 15 in = fs.open( new Path(uri)); 16 IOUtils.copyBytes(in, System.out, 4096, false); 17 } finally { 18 IOUtils.closeStream(in); 19 } 20 } 21 }
3.3實(shí)現(xiàn)過(guò)程
3.3.1創(chuàng)建代碼目錄
使用如下命令啟動(dòng)Hadoop
cd /app/hadoop-1.1.2/bin
./start-all.sh
在/app/hadoop-1.1.2目錄下使用如下命令建立myclass和input目錄:
cd /app/hadoop-1.1.2
mkdir myclass
mkdir input
3.3.2建立例子文件上傳到HDFS中
進(jìn)入/app/hadoop-1.1.2/input目錄,在該目錄中建立quangle.txt文件
cd /app/hadoop-1.1.2/input
touch quangle.txt
vi quangle.txt
內(nèi)容為:
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
使用如下命令在hdfs中建立目錄/class4
hadoop fs -mkdir /class4
hadoop fs -ls /
(如果需要直接使用hadoop命令,需要把/app/hadoop-1.1.2加入到Path路徑中)
把例子文件上傳到hdfs的/class4文件夾中
cd /app/hadoop-1.1.2/input
hadoop fs -copyFromLocal quangle.txt /class4/quangle.txt
hadoop fs -ls /class4
3.3.3配置本地環(huán)境
對(duì)/app/hadoop-1.1.2/conf目錄中的hadoop-env.sh進(jìn)行配置,如下如所示:
cd /app/hadoop-1.1.2/conf
sudo vi hadoop-env.sh
加入對(duì)HADOOP_CLASPATH變量值,值為/app/hadoop-1.1.2/myclass,設(shè)置完畢后編譯該配置文件,使配置生效
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/app/hadoop-1.1.2/myclass
3.3.4編寫代碼
進(jìn)入/app/hadoop-1.1.2/myclass目錄,在該目錄中建立FileSystemCat.java代碼文件,命令如下:
cd /app/hadoop-1.1.2/myclass/
vi FileSystemCat.java
輸入代碼內(nèi)容:
3.3.5編譯代碼
在/app/hadoop-1.1.2/myclass目錄中,使用如下命令編譯代碼:
javac -classpath ../hadoop-core-1.1.2.jar FileSystemCat.java
ls
3.3.6使用編譯代碼讀取HDFS文件
使用如下命令讀取HDFS中/class4/quangle.txt內(nèi)容:
hadoop FileSystemCat /class4/quangle.txt
4、測(cè)試?yán)?/span>2
4.1測(cè)試?yán)?span lang="EN-US">2內(nèi)容
在本地文件系統(tǒng)生成一個(gè)大約100字節(jié)的文本文件,寫一段程序讀入這個(gè)文件并將其第101-120字節(jié)的內(nèi)容寫入HDFS成為一個(gè)新文件。
4.2運(yùn)行代碼
注意:在編譯前請(qǐng)先刪除中文注釋!
1 import java.io.File; 2 import java.io.FileInputStream; 3 import java.io.FileOutputStream; 4 import java.io.OutputStream; 5 import java.net.URI; 6 7 import org.apache.hadoop.conf.Configuration; 8 import org.apache.hadoop.fs.FSDataInputStream; 9 import org.apache.hadoop.fs.FileSystem; 10 import org.apache.hadoop.fs.Path; 11 import org.apache.hadoop.io.IOUtils; 12 import org.apache.hadoop.util.Progressable; 13 14 public class LocalFile2Hdfs { 15 public static void main(String[] args) throws Exception { 16 17 // 獲取讀取源文件和目標(biāo)文件位置參數(shù) 18 String local = args[0]; 19 String uri = args[1]; 20 21 FileInputStream in = null; 22 OutputStream out = null; 23 Configuration conf = new Configuration(); 24 try { 25 // 獲取讀入文件數(shù)據(jù) 26 in = new FileInputStream(new File(local)); 27 28 // 獲取目標(biāo)文件信息 29 FileSystem fs = FileSystem.get(URI.create(uri), conf); 30 out = fs.create(new Path(uri), new Progressable() { 31 @Override 32 public void progress() { 33 System.out.println("*"); 34 } 35 }); 36 37 // 跳過(guò)前100個(gè)字符 38 in.skip(100); 39 byte[] buffer = new byte[20]; 40 41 // 從101的位置讀取20個(gè)字符到buffer中 42 int bytesRead = in.read(buffer); 43 if (bytesRead >= 0) { 44 out.write(buffer, 0, bytesRead); 45 } 46 } finally { 47 IOUtils.closeStream(in); 48 IOUtils.closeStream(out); 49 } 50 }
4.3.1編寫代碼
進(jìn)入/app/hadoop-1.1.2/myclass目錄,在該目錄中建立LocalFile2Hdfs.java代碼文件,命令如下:
vi LocalFile2Hdfs.java
輸入代碼內(nèi)容:
4.3.2編譯代碼
在/app/hadoop-1.1.2/myclass目錄中,使用如下命令編譯代碼:
javac -classpath ../hadoop-core-1.1.2.jar LocalFile2Hdfs.java
4.3.3建立測(cè)試文件
進(jìn)入/app/hadoop-1.1.2/input目錄,在該目錄中建立local2hdfs.txt文件
cd /app/hadoop-1.1.2/input/
vi local2hdfs.txt
內(nèi)容為:
Washington (CNN) -- Twitter is suing the U.S. government in an effort to loosen restrictions on what the social media giant can say publicly about the national security-related requests it receives for user data.
The company filed a lawsuit against the Justice Department on Monday in a federal court in northern California, arguing that its First Amendment rights are being violated by restrictions that forbid the disclosure of how many national security letters and Foreign Intelligence Surveillance Act court orders it receives -- even if that number is zero.
Twitter vice president Ben Lee wrote in a blog post that it's suing in an effort to publish the full version of a "transparency report" prepared this year that includes those details.
The San Francisco-based firm was unsatisfied with the Justice Department's move in January to allow technological firms to disclose the number of national security-related requests they receive in broad ranges.
4.3.4使用編譯代碼上傳文件內(nèi)容到HDFS
使用如下命令讀取local2hdfs第101-120字節(jié)的內(nèi)容寫入HDFS成為一個(gè)新文件:
cd /app/hadoop-1.1.2/input
hadoop LocalFile2Hdfs local2hdfs.txt /class4/local2hdfs_part.txt
4.3.5驗(yàn)證是否成功
使用如下命令讀取local2hdfs_part.txt內(nèi)容:
hadoop fs -cat /class4/local2hdfs_part.txt
5、測(cè)試?yán)?/span>3
5.1測(cè)試?yán)?span lang="EN-US">3內(nèi)容
測(cè)試?yán)?span lang="EN-US">2的反向操作,在HDFS中生成一個(gè)大約100字節(jié)的文本文件,寫一段程序讀入這個(gè)文件,并將其第101-120字節(jié)的內(nèi)容寫入本地文件系統(tǒng)成為一個(gè)新文件。
5.2程序代碼
1 import java.io.File; 2 import java.io.FileInputStream; 3 import java.io.FileOutputStream; 4 import java.io.OutputStream; 5 import java.net.URI; 6 7 import org.apache.hadoop.conf.Configuration; 8 import org.apache.hadoop.fs.FSDataInputStream; 9 import org.apache.hadoop.fs.FileSystem; 10 import org.apache.hadoop.fs.Path; 11 import org.apache.hadoop.io.IOUtils; 12 13 public class Hdfs2LocalFile { 14 public static void main(String[] args) throws Exception { 15 16 String uri = args[0]; 17 String local = args[1]; 18 19 FSDataInputStream in = null; 20 OutputStream out = null; 21 Configuration conf = new Configuration(); 22 try { 23 FileSystem fs = FileSystem.get(URI.create(uri), conf); 24 in = fs.open(new Path(uri)); 25 out = new FileOutputStream(local); 26 27 byte[] buffer = new byte[20]; 28 in.skip(100); 29 int bytesRead = in.read(buffer); 30 if (bytesRead >= 0) { 31 out.write(buffer, 0, bytesRead); 32 } 33 } finally { 34 IOUtils.closeStream(in); 35 IOUtils.closeStream(out); 36 } 37 } 38 }
5.3.1編寫代碼
進(jìn)入/app/hadoop-1.1.2/myclass目錄,在該目錄中建立Hdfs2LocalFile.java代碼文件,命令如下:
cd /app/hadoop-1.1.2/myclass/
vi Hdfs2LocalFile.java
輸入代碼內(nèi)容:
5.3.2編譯代碼
在/app/hadoop-1.1.2/myclass目錄中,使用如下命令編譯代碼:
javac -classpath ../hadoop-core-1.1.2.jar Hdfs2LocalFile.java
5.3.3建立測(cè)試文件
進(jìn)入/app/hadoop-1.1.2/input目錄,在該目錄中建立hdfs2local.txt文件
cd /app/hadoop-1.1.2/input/
vi hdfs2local.txt
內(nèi)容為:
The San Francisco-based firm was unsatisfied with the Justice Department's move in January to allow technological firms to disclose the number of national security-related requests they receive in broad ranges.
"It's our belief that we are entitled under the First Amendment to respond to our users' concerns and to the statements of U.S. government officials by providing information about the scope of U.S. government surveillance -- including what types of legal process have not been received," Lee wrote. "We should be free to do this in a meaningful way, rather than in broad, inexact ranges."
在/app/hadoop-1.1.2/input目錄下把該文件上傳到hdfs的/class4/文件夾中
hadoop fs -copyFromLocal hdfs2local.txt /class4/hdfs2local.txt
hadoop fs -ls /class4/
5.3.4使用編譯代碼把文件內(nèi)容從HDFS輸出到文件系統(tǒng)中
使用如下命令讀取hdfs2local.txt第101-120字節(jié)的內(nèi)容寫入本地文件系統(tǒng)成為一個(gè)新文件:
hadoop Hdfs2LocalFile /class4/hdfs2local.txt hdfs2local_part.txt
5.3.5驗(yàn)證是否成功
使用如下命令讀取hdfs2local_part.txt內(nèi)容:
cat hdfs2local_part.txt

![clip_image001[6] clip_image001[6]](https://images0.cnblogs.com/blog/107289/201507/102233421893207.jpg)
![clip_image003[6] clip_image003[6]](https://images0.cnblogs.com/blog/107289/201507/102233452529204.gif)
![clip_image005[6] clip_image005[6]](https://images0.cnblogs.com/blog/107289/201507/102233502218586.gif)
![clip_image007[6] clip_image007[6]](https://images0.cnblogs.com/blog/107289/201507/102233572684367.gif)
![clip_image009[6] clip_image009[6]](https://images0.cnblogs.com/blog/107289/201507/102234002835378.jpg)
![clip_image011[6] clip_image011[6]](https://images0.cnblogs.com/blog/107289/201507/102234013146664.jpg)
![clip_image013[6] clip_image013[6]](https://images0.cnblogs.com/blog/107289/201507/102234037364790.jpg)
![clip_image015[6] clip_image015[6]](https://images0.cnblogs.com/blog/107289/201507/102234050331061.jpg)
![clip_image017[6] clip_image017[6]](https://images0.cnblogs.com/blog/107289/201507/102234062365359.jpg)
![clip_image019[6] clip_image019[6]](https://images0.cnblogs.com/blog/107289/201507/102234080962759.jpg)
![clip_image021[6] clip_image021[6]](https://images0.cnblogs.com/blog/107289/201507/102234106894300.jpg)
![clip_image023[6] clip_image023[6]](https://images0.cnblogs.com/blog/107289/201507/102234126892956.jpg)
![clip_image025[6] clip_image025[6]](https://images0.cnblogs.com/blog/107289/201507/102234152527039.jpg)
![clip_image027[6] clip_image027[6]](https://images0.cnblogs.com/blog/107289/201507/102234183936480.jpg)
![clip_image029[6] clip_image029[6]](https://images0.cnblogs.com/blog/107289/201507/102234205965607.jpg)
![clip_image031[6] clip_image031[6]](https://images0.cnblogs.com/blog/107289/201507/102234223146049.jpg)
![clip_image033[6] clip_image033[6]](https://images0.cnblogs.com/blog/107289/201507/102234242059503.jpg)
![clip_image035[6] clip_image035[6]](https://images0.cnblogs.com/blog/107289/201507/102234274399916.jpg)
![clip_image037[6] clip_image037[6]](https://images0.cnblogs.com/blog/107289/201507/102234294398572.jpg)
![clip_image039[6] clip_image039[6]](https://images0.cnblogs.com/blog/107289/201507/102234315808483.jpg)
![clip_image041[6] clip_image041[6]](https://images0.cnblogs.com/blog/107289/201507/102234342527767.jpg)
![clip_image043[6] clip_image043[6]](https://images0.cnblogs.com/blog/107289/201507/102234360961939.jpg)
![clip_image045[6] clip_image045[6]](https://images0.cnblogs.com/blog/107289/201507/102234400968249.jpg)
浙公網(wǎng)安備 33010602011771號(hào)