快速目錄和文件遍歷
遍歷一個目錄或者磁盤中的所有內容,常用的算法有兩種:深度優先和廣度優先。具體實現的時候,每種算法都可以有多種實現,一般來說,有遞歸和非遞歸兩種。因為工作需要,所以bigtall實現了幾種算法的對比。
首先實現的是傳統的深度優先的遞歸遍歷算法,因為非遞歸算法和廣度優先比較雷同所以沒有實現。其次實現的是廣度優先的遞歸和非遞歸算法,其中非遞歸廣度算法采用一個先進先出的queue存儲目錄路徑結果。最后實現的是基于非遞歸廣度算法上實現的多線程搜索算法,因為現在已經是多核時代,不用多線程好像說不過去了。
測試代碼附在文后。bigtall用于測試的環境是vista sp2,文件系統是ntfs,cpu intel T2300 1.66G雙核,內存4G。測試所用的目錄為兩個,一個是含有2148個對象的d:\temp目錄,另外一個是整個d盤,含54萬個文件。
運行結果:
D:\work\delphi\findfirst>fastfind d:\temp 16 1
algorithm, files, total time(ms), repeat, average time(ms)
Deep walk, 2148, 46, 1, 46
Wide Recurse, 2148, 78, 1, 78
Wide Cycle, 2148, 16, 1, 16
Thread 2 walk, 2148, 31, 1, 31
Thread 4 walk, 2148, 61, 1, 61
Thread 6 walk, 2148, 93, 1, 93
Thread 8 walk, 2148, 124, 1, 124
Thread 10 walk, 2148, 156, 1, 156
Thread 12 walk, 2148, 186, 1, 186
Thread 14 walk, 2148, 218, 1, 218
Thread 16 walk, 2148, 249, 1, 249D:\work\delphi\findfirst>fastfind d:\temp 16 2
algorithm, files, total time(ms), repeat, average time(ms)
Deep walk, 2148, 93, 2, 46
Wide Recurse, 2148, 93, 2, 46
Wide Cycle, 2148, 63, 2, 31
Thread 2 walk, 2148, 61, 2, 30
Thread 4 walk, 2148, 124, 2, 62
Thread 6 walk, 2148, 186, 2, 93
Thread 8 walk, 2148, 250, 2, 125
Thread 10 walk, 2148, 311, 2, 155
Thread 12 walk, 2148, 373, 2, 186
Thread 14 walk, 2148, 437, 2, 218
Thread 16 walk, 2148, 514, 2, 257D:\work\delphi\findfirst>fastfind d:\temp 16 4
algorithm, files, total time(ms), repeat, average time(ms)
Deep walk, 2148, 171, 4, 42
Wide Recurse, 2148, 203, 4, 50
Wide Cycle, 2148, 109, 4, 27
Thread 2 walk, 2148, 139, 4, 34
Thread 4 walk, 2148, 249, 4, 62
Thread 6 walk, 2148, 373, 4, 93
Thread 8 walk, 2148, 500, 4, 125
Thread 10 walk, 2148, 623, 4, 155
Thread 12 walk, 2148, 747, 4, 186
Thread 14 walk, 2148, 874, 4, 218
Thread 16 walk, 2148, 997, 4, 249D:\work\delphi\findfirst>fastfind d:\temp 16 8
algorithm, files, total time(ms), repeat, average time(ms)
Deep walk, 2148, 264, 8, 33
Wide Recurse, 2148, 343, 8, 42
Wide Cycle, 2148, 219, 8, 27
Thread 2 walk, 2148, 248, 8, 31
Thread 4 walk, 2148, 499, 8, 62
Thread 6 walk, 2148, 747, 8, 93
Thread 8 walk, 2148, 999, 8, 124
Thread 10 walk, 2148, 1248, 8, 156
Thread 12 walk, 2148, 1496, 8, 187
Thread 14 walk, 2148, 1748, 8, 218
Thread 16 walk, 2148, 1995, 8, 249D:\work\delphi\findfirst>fastfind d:\ 16 1
algorithm, files, total time(ms), repeat, average time(ms)
Deep walk, 545519, 159447, 1, 159447
Wide Recurse, 545519, 15678, 1, 15678
Wide Cycle, 161792, 2042, 1, 2042
Thread 2 walk, 545519, 6552, 1, 6552
Thread 4 walk, 545519, 6771, 1, 6771
Thread 6 walk, 545519, 6941, 1, 6941
Thread 8 walk, 545519, 6753, 1, 6753
Thread 10 walk, 545519, 7066, 1, 7066
Thread 12 walk, 545519, 7036, 1, 7036
Thread 14 walk, 545519, 7691, 1, 7691
Thread 16 walk, 545519, 7065, 1, 7065D:\work\delphi\findfirst>fastfind d:\ 16 2
algorithm, files, total time(ms), repeat, average time(ms)
Deep walk, 545519, 21277, 2, 10638
Wide Recurse, 545519, 30201, 2, 15100
Wide Cycle, 161792, 3915, 2, 1957
Thread 2 walk, 545519, 13104, 2, 6552
Thread 4 walk, 545519, 13836, 2, 6918
Thread 6 walk, 545519, 13883, 2, 6941
Thread 8 walk, 545519, 14211, 2, 7105
Thread 10 walk, 545519, 13978, 2, 6989
Thread 12 walk, 545519, 14788, 2, 7394
Thread 14 walk, 545519, 15975, 2, 7987
Thread 16 walk, 545519, 16988, 2, 8494D:\work\delphi\findfirst>fastfind d:\ 16 4
algorithm, files, total time(ms), repeat, average time(ms)
Deep walk, 545519, 42010, 4, 10502
Wide Recurse, 545519, 64584, 4, 16146
Wide Cycle, 161792, 7814, 4, 1953
Thread 2 walk, 545519, 26317, 4, 6579
Thread 4 walk, 545519, 27205, 4, 6801
Thread 6 walk, 545519, 27487, 4, 6871
Thread 8 walk, 545519, 27877, 4, 6969
Thread 10 walk, 545519, 32994, 4, 8248
Thread 12 walk, 545519, 49841, 4, 12460
Thread 14 walk, 545519, 72072, 4, 18018
Thread 16 walk, 545519, 68047, 4, 17011D:\work\delphi\findfirst>fastfind d:\ 16 8
algorithm, files, total time(ms), repeat, average time(ms)
Deep walk, 545519, 106392, 8, 13299
Wide Recurse, 545519, 124331, 8, 15541
Wide Cycle, 161792, 15240, 8, 1905
Thread 2 walk, 545519, 53352, 8, 6669
Thread 4 walk, 545519, 64303, 8, 8037
Thread 6 walk, 545519, 132849, 8, 16606
Thread 8 walk, 545520, 156951, 8, 19618
Thread 10 walk, 545520, 67907, 8, 8488
Thread 12 walk, 545520, 59747, 8, 7468
Thread 14 walk, 545520, 62946, 8, 7868
Thread 16 walk, 545520, 102616, 8, 12827
結果分析:
- 深度優先的遍歷速度和廣度優先相比,占用資源最少,在早期的代碼中,廣度優先的速度要比深度優先的速度快,但是這里的結果卻正好相反,估計是TList存取速度的問題,下次有機會應該修改一下。
- 非遞歸的廣度優先速度非常的快,但是在之前沒有用自定義的TSimpleQuery之前,速度比廣度遞歸還要慢,說明系統的TQuery實現比較慢。而且本算法性能比任何其他算法都要高。
- 在多線程遍歷中,12個線程速度最快,但是在文件數目較少的情況下,線程越多遍歷速度越慢。
- 這里的算法都是通用的算法,如果針對ntfs文件系統,更快的遍歷算法是讀取windows內部的入口數據,但是它不算是一種性價比最高的算法。
由此我們得到結論:
- 單線程非遞歸的廣度優先遍歷算法速度最快。
要感謝 邢少網友的留言,因此我們要深入探討一下這個測試中表現出的問題:
1. 為什么遞歸不如非遞歸快
遞歸算法可以簡化算法的復雜度,但是會占用大量的棧(stack)空間,同時也會產生大量的函數調用代碼。因為程序中使用堆(heap)和棧(stack)的方法有本質的不同,可利用的最大空間也不一樣,分配棧空間的方法也比分配堆空間的方法耗費要大些。并且遞歸很容易引起“堆棧下溢”。這就是非遞歸要比遞歸快的原因。
2.為什么深度優先算法不如廣度優先算法快
windows搜索文件的方法必定是這樣:FindFirstFile,FindNextFile,FindClose的一個次序,這里不僅用到了用戶變量WIN32_FIND_DATA,也用到了handle。這就意味著操作系統需要為搜索分配相應的資源,如果是深度優先的搜索,遞歸過程會在調用findClose之前進行,換句話說,為某個目錄分配的搜索資源要在這個目錄所有子目錄搜索完畢之后才能釋放。而廣度優先的搜索則是在本次搜索的FindClose之后才開始深入搜索,其耗費的系統資源當然要比深度優先搜索要少得多了。因此,廣度優先的搜索比深度優先搜索要快也就顯而易見了。
3.為什么多線程不如單線程快
多線程一定要比單線程快,但是在操作IO的時候,大部分IO設備是不支持并行存取的。存取磁盤的時候,每次只允許處理一個請求。因此,用多線程來進行文件搜索,反而會造成大量的互斥操作,反而影響了速度。
附加代碼如下:
算法代碼
unit Traversal;
interface
uses Classes, Windows, SyncObjs, Generics.Collections;
type
TSimpleQueue = class
private
FList : array of string;
FCapacity, FRead, FWrite:Integer; // read 當前讀取,write當前可寫
function GetCount:Integer;
public
constructor Create();
destructor Destroy; override;
procedure Enqueue(const Value: string);
function Dequeue: string;
procedure Clear;
property Count: Integer read GetCount;
end;
TSimpleList = class
private
FList : array of string;
FCapacity, FWrite:Integer; // write當前可寫
function GetCount:Integer;
public
constructor Create();
destructor Destroy; override;
procedure Add(const Value: string);
procedure Clear;
property Count: Integer read GetCount;
public type
TSimpleListEnumerator = class
private
FOwner : TSimpleList;
FIndex :Integer;
public
function MoveNext:boolean;
function GetCurrent:string;
property Current:string read GetCurrent;
end;
function GetEnumerator():TSimpleListEnumerator;
end;
TDeepFirstTraversal = class
public
class function Walk(const path:string):Integer; static;
end;
TWideFirstTraversal = class
public
class function Walk(const path:string):Integer; static;
end;
TWideCycleTraversal = class
public
class function Walk(const path:string):Integer; static;
end;
TMultiThreadTraversal = class
public type
TWalkThread = class(TThread)
private
class var WaitCount:Integer; // 隊列空等待的線程數量
class var count:Integer; // 還在運行的線程數量
class var FileCount:Integer; // 返回文件個數
class var queue:TSimpleQueue;
class var critical : TCriticalSection; // 用于同步queue存取
protected
/// <summary>
/// 隊列空則等待,獲取成功則返回true,隊列全空則false
/// </summary>
function GetWait(out value:string):boolean;
function Put(const value:string):boolean;
procedure Execute; override;
end;
public
class function Walk(const path:string; threads:Integer):Integer; static;
end;
implementation
uses SysUtils;
{ TDeepFirstTraversal }
function IsValidDir(const ffd : WIN32_FIND_DATA):Boolean;
var
special:boolean;
begin
Result := ((ffd.dwFileAttributes and FILE_ATTRIBUTE_DIRECTORY) <> 0);
if Result then begin
special := (ffd.cFileName[0] = '.') and (
(ffd.cFileName[1] = #0) or (
(ffd.cFileName[1] = '.') and (ffd.cFileName[2] = #0)
)
);
Result := not special;
end;
end;
class function TDeepFirstTraversal.Walk(const path: string):Integer;
function DoWalk(const path:string):Integer;
var
ffd : WIN32_FIND_DATA;
handle : THandle;
pattern : string;
begin
Result := 0;
pattern := path + '\*.*';
handle := Windows.FindFirstFile(PChar(pattern), ffd);
if INVALID_HANDLE_VALUE <> handle then begin
repeat
Inc(Result);
// 判斷是否目錄
if IsValidDir(ffd) then begin
Result := Result + DoWalk(path + '\' + ffd.cFileName);
end;
until not Windows.FindNextFile(handle, ffd);
Windows.FindClose(handle);
end;
end;
begin
Result := DoWalk(ExcludeTrailingPathDelimiter(path));
end;
{ TWideFirstTraversal }
class function TWideFirstTraversal.Walk(const path: string):Integer;
function DoWalk(const path:string):Integer;
var
ffd : WIN32_FIND_DATA;
handle : THandle;
pattern : string;
dirs: TSimpleList;
d : string;
begin
Result := 0;
dirs := TSimpleList.Create;
pattern := path + '\*.*';
handle := Windows.FindFirstFile(PChar(pattern), ffd);
if INVALID_HANDLE_VALUE <> handle then begin
repeat
Inc(Result);
// 判斷是否目錄
if IsValidDir(ffd) then begin
dirs.Add(path + '\' + ffd.cFileName);
end;
until not Windows.FindNextFile(handle, ffd);
Windows.FindClose(handle);
end;
for d in dirs do
Result := Result + DoWalk(d);
dirs.Clear;
FreeAndNil(dirs);
end;
begin
Result := DoWalk(ExcludeTrailingPathDelimiter(path));
end;
{ TWideCycleTraversal }
class function TWideCycleTraversal.Walk(const path: string): Integer;
function DoWalk(const path:string; const queue : TSimpleQueue):Integer;
var
ffd : WIN32_FIND_DATA;
handle : THandle;
pattern : string;
d : string;
begin
Result := 0;
pattern := path + '\*.*';
handle := Windows.FindFirstFile(PChar(pattern), ffd);
if INVALID_HANDLE_VALUE <> handle then begin
repeat
Inc(Result);
// 判斷是否目錄
if IsValidDir(ffd) then begin
queue.Enqueue(path + '\' + ffd.cFileName);
end;
until not Windows.FindNextFile(handle, ffd);
Windows.FindClose(handle);
end;
end;
var
queue : TSimpleQueue;
begin
queue := TSimpleQueue.Create;
Result := DoWalk(ExcludeTrailingPathDelimiter(path), queue);
while queue.Count > 0 do
Result := Result + DoWalk(queue.Dequeue, queue);
queue.Clear;
FreeAndNil(queue);
end;
{ TMultiThreadTraversal }
class function TMultiThreadTraversal.Walk(const path: string; threads:Integer):Integer;
var
t1: array of TWalkThread;
I: Integer;
begin
Result := 0;
TMultiThreadTraversal.TWalkThread.WaitCount := threads;
TMultiThreadTraversal.TWalkThread.count := 0;
TMultiThreadTraversal.TWalkThread.queue := TSimpleQueue.Create();
TMultiThreadTraversal.TWalkThread.critical := TCriticalSection.Create();
try
TMultiThreadTraversal.TWalkThread.Count := 0;
TMultiThreadTraversal.TWalkThread.FileCount := 0;
SetLength(t1, threads);
for I := 0 to threads - 1 do begin
t1[I] := TWalkThread.Create(true);
t1[i].FreeOnTerminate := false;
end;
// 加入根目錄到堆棧
TMultiThreadTraversal.TWalkThread.queue.Enqueue(ExcludeTrailingPathDelimiter(path));
// 啟動所有線程
for I := 0 to threads - 1 do begin
t1[i].Resume;
Sleep(1);
end;
// 等待線程結束
for I := 0 to threads - 1 do begin
t1[i].WaitFor();
t1[i].Free;
end;
finally
TMultiThreadTraversal.TWalkThread.queue.Clear;
FreeAndNil(TMultiThreadTraversal.TWalkThread.queue);
FreeAndNil(TMultiThreadTraversal.TWalkThread.critical);
end;
Result := TMultiThreadTraversal.TWalkThread.FileCount;
end;
{ TMultiThreadTraversal.TWalkThread }
procedure TMultiThreadTraversal.TWalkThread.Execute;
function DoWalk(const path:string):Integer;
var
ffd : WIN32_FIND_DATA;
handle : THandle;
pattern : string;
d : string;
begin
Result := 0;
pattern := path + '\*.*';
handle := Windows.FindFirstFile(PChar(pattern), ffd);
if INVALID_HANDLE_VALUE <> handle then begin
repeat
Inc(Result);
// 判斷是否目錄
if IsValidDir(ffd) then begin
Put(path + '\' + ffd.cFileName);
end;
until not Windows.FindNextFile(handle, ffd);
Windows.FindClose(handle);
end;
end;
var
path:string;
i : Integer;
begin
while GetWait(path) do begin
// 增加線程運行計數
InterlockedExchangeAdd(TMultiThreadTraversal.TWalkThread.count, 1);
I := DoWalk(path);
InterlockedExchangeAdd(TMultiThreadTraversal.TWalkThread.FileCount, i);
// 運行結束則減少計數器
InterlockedExchangeAdd(TMultiThreadTraversal.TWalkThread.count, -1);
end;
end;
function TMultiThreadTraversal.TWalkThread.GetWait(out value: string): boolean;
begin
Result := true;
InterlockedExchangeAdd(TMultiThreadTraversal.TWalkThread.WaitCount, -1);
while true do begin
while TMultiThreadTraversal.TWalkThread.queue.Count = 0 do begin
Sleep(1);
if (TMultiThreadTraversal.TWalkThread.WaitCount = 0)
and (TMultiThreadTraversal.TWalkThread.Count = 0)
then begin
//InterlockedExchangeAdd(TMultiThreadTraversal.TWalkThread.WaitCount, 1);
Exit(False);
end;
end;
if TMultiThreadTraversal.TWalkThread.critical.TryEnter then begin
try
if TMultiThreadTraversal.TWalkThread.queue.Count <> 0 then begin
value := TMultiThreadTraversal.TWalkThread.queue.Dequeue;
Exit(true);
end;
finally
InterlockedExchangeAdd(TMultiThreadTraversal.TWalkThread.WaitCount, 1);
TMultiThreadTraversal.TWalkThread.critical.Leave;
end;
end;
end;
end;
function TMultiThreadTraversal.TWalkThread.Put(const value: string): boolean;
begin
Result := true;
TMultiThreadTraversal.TWalkThread.critical.Enter;
try
TMultiThreadTraversal.TWalkThread.queue.enqueue(value);
finally
TMultiThreadTraversal.TWalkThread.critical.Leave;
end;
end;
{ TSimpleQueue }
procedure TSimpleQueue.Clear;
begin
FRead := 0;
FWrite := 0;
end;
constructor TSimpleQueue.Create;
begin
FCapacity := 1024 * 16;
SetLength(FList, FCapacity);
FRead := 0;
FWrite := 0;
end;
function TSimpleQueue.Dequeue: string;
begin
if FRead = FWrite then
raise ERangeError.Create('no element');
Result := FList[FRead];
FRead := (FRead + 1) mod FCapacity;
end;
destructor TSimpleQueue.Destroy;
begin
inherited;
end;
procedure TSimpleQueue.Enqueue(const Value: string);
var
n : Integer;
begin
n := (FWrite + 1) mod FCapacity;
if n = FRead then
raise ERangeError.Create('full');
FList[FWrite] := Value;
FWrite := n;
end;
function TSimpleQueue.GetCount: Integer;
begin
Result := FWrite - FRead;
end;
{ TSimpleList }
procedure TSimpleList.Add(const Value: string);
begin
if FWrite = FCapacity then
raise ERangeError.Create('full');
FList[FWrite] := value;
Inc(FWrite);
end;
procedure TSimpleList.Clear;
begin
FWrite := 0;
end;
constructor TSimpleList.Create;
begin
FCapacity := 32 * 1024;
SetLength(FList, FCapacity);
FWrite := 0;
end;
destructor TSimpleList.Destroy;
begin
inherited;
end;
function TSimpleList.GetCount: Integer;
begin
Result := FWrite;
end;
function TSimpleList.GetEnumerator: TSimpleListEnumerator;
begin
Result := TSimpleListEnumerator.Create;
Result.FOwner := self;
Result.FIndex := -1;
end;
{ TSimpleList.TSimpleListEnumerator }
function TSimpleList.TSimpleListEnumerator.GetCurrent: string;
begin
if (FIndex <= -1) or (FIndex >= FOwner.FWrite) then
raise ERangeError.Create('position error');
Result := FOwner.FList[FIndex];
end;
function TSimpleList.TSimpleListEnumerator.MoveNext: boolean;
begin
Inc(FIndex);
Result := FIndex < FOwner.FWrite;
end;
end.

公眾號:老翅寒暑

浙公網安備 33010602011771號