?? package-summary.html
字號:
<p>LaHonda in below is reference to meeting of John, Gordon and Stack atLaHonda Cafe on 16th St., on August 8th, 2006.</p><ul><li>Leave off 9.2 GZIP extra fields. Big section on implementing an optionthat has little to do with WARCing. AGREED at LaHonda.</li><li>But, we need to mark gzipped files as being WARC: i.e. that the GZIP is a member per resource. Its useful so readers know how to invokeGZIP (That it has to be done once to get at any record or just need todo per record). Suggest adding GZIP extra field in HEAD ofGZIP member that says 'WARC' (ARC has such a thing currently). NOT NECESSARY per LaHonda meeting.</li><li>IP-Address for dns resource is DNS Server. Add note to this effect in8.2 DNS.</li><li>Section 6. is truncated -- missing text. What was intended here? SEEISO DOC.</li><li>In-line ANVL definition (From Kunze). Related, can labels haveCTLs such as CRLF (Shouldn't)? When says 'control-chars', does this includeUNICODE control characters (Should)? CHAR is described as ASCII/UTF-8 but theyare not same (Should be UTF-8). ANVL OR NOT STILL UP IN AIR AFTER LaHonda.Postpone to 0.11 revision.</li><li>Fix examples. Use output of experimental ARC Writer.</li><li>Fix ambiguity in spec. pertaining to 'smallest possible anvl-fields' notcited by Mads Alhof Kristiansen in <a href="ftp://ftp.diku.dk/diku/semantics/papers/D-548.pdf">Digital Preservationusing the WARC File Format</a>.</li></ul><h2>Open Issues</h2><h3>Drop response record type</h3><p><code>resource</code> is sufficent. Let mimetype distingush if capture withresponse headers or not (As per comment at end of <i>8.1 HTTP and HTTPS</i>where it allows that if no response headers, use resource record type andpage mimetype rather than response type plus a mimetype of message/http: Thedifference in record types is not needed distingushing between the twotypes of capture)</p><p>Are there other capture methods that would require a response record,that don't have a mimetype that includes response headers and content?SMTP has rich MIME set to describe responses. Its request ispretty much unrecordable. NNTP and FTP similar. Because of rich MIME, noneed of a special response type here.</p><p>Related, do we need the <code>request</code> record?Only makes sense for HTTP?</p><p>This proposal is contentious. Gordon drew scenario where responsewould be needed distingushing local from remote capture if an archivinginstitution purposefully archived without recording headers orif the payload itself was an archived record. In opposition, was suggested thatshould an institution choose to cature in this 'unusual' mode, crawl metadatacould be used consulted to disambiguate confusion on how capture was done (Tobe further investigated. In general, definition of record types is still in need of work).</p><h3>subject-url</h3><p>The ISO revision suggests that the positional parameter <code>subject-uri</code> be renamed. Suggest <code>record-url<code>.</p><h3>Other issues</h3><ul><li>Should we allow freeform creation of custom Named Fields ifhave a MIME-like 'X-' or somesuch prefix?</li><li>Nothing on header-line encoding (Section 11 says UTF-8). For completeness should be US-ASCII or UTF-8, no control-chars (especiallyCR or LF), etc.</li><li><code>warcinfo</code><ul><li>What for a scheme? Using UUID as per G suggestion.</li><li>Also, how to populate description of crawl into warcinfo?'Documentation' <code>Named Field</code> with list of URLs that can be assumedto exist somewhere in the current WARC set (We'd have to make the crawler goget them at start of a crawl).</li><li>I don't want to repeat crawl description for every WARC. How to have thiswarcinfo point at an original? <code>related-record-id</code> seemsinsufficent.</li><li>If the crawler config. changes, can I just write a warcinfo withdifferences? How to express? Or better as metadata about a warcinfo?</li><li>In the pastwe used to get the filename from this URL header field when we unsure of thefilename or it was unavailable (We're reading a Stream). Won't be able to dothat with UUID for URL. So, introducing new warcinfo Named Field (optional)'Filename' that will be used when warcinfo is put at start of a file.Allow warcinfo to have a named parameter 'Filename'?</li></ul></li><li><code>revisit</code><ul><li>What to write? Use a description field or just expect this info to be present in the warcinfo? Example has request header(inside XML). Better to use associated <code>request</code> record for thiskind of info?</li><li><code>Related-Record-ID</code> (RRID) of original is likelyan onerous requirement. Envisioning an implementation where we'd write<code>revisit</code> records, we'd write such a record where content wasjudged same or where date since last fetch had not changed. If we're towrite the RRID, then we'd have to maintain table keyed by URL with value ofpage hash or of last modified-date plus associated RRID (actual RRIDURL, not a hash).</li></ul></li><li>Should we allow a <code>Description</code> <code>Named Field</code>.E.g. I add an order file as a metadata record and associate with a<code>warcinfo</code> record. Description field could say "This is HeritrixOrder file". Same for seeds. Alternative is custom XML packaging (Schemecould describe fields such as 'order' file or ANVL packaging using ANVL'comments'.</li><li>Section 11, why was it we said we don't need a parameter or explicitsubtype for special gzip WARC format? I don't remember? Reader needs toknow when its reading a stream. A client would like to know so it wrotestream to disk with right suffix? Recap. (Perhaps it was looking atthe MAGIC bytes -- if it starts with GZIP MAGIC and includes extra fieldsthat denote it WARC, thats sufficent?).</li><li>Section 7, on truncation, on 7.1, suggest values -- 'time', 'length' --but allow free form description?Leave off 'superior method of indicating truncation' paragraph. This qualifiercould be added to all sections of doc -- that a subsequent revision of any aspect of the doc. will be superior. Rather than <code>End-Length</code>, like MIME, last record could have<code>Segment-Number-Total</code>, a count of all segments that make upcomplete record.</li></ul><p>From LaHonda, discussion of <code>revisit</code> type. Definition wastighted some by saying revisit is used when you chose not to store the capture.Was thought possible that itNOT require pointer back to an original. Suggested it might have asimilarity judgment header -- <code>similiarity-value</code> -- with valuesbetween 0 and 1. Might also have <code>analysis-method</code> and<code>description</code>. Possible methods discussed included: URI same,length same, hash of content same, judgement based off content of HTTP HEADrequest, etc. Possible payloads might be: Nothing, a diff, the hash obtained,etc.</p><h2>Unimplemented</h2><ul><li>Record Segmentation (4.8 <code>continuation</code> record typeand the 5.2 <code>Segment-*</code> Named Parameters. Future TODO.</li><li>4.7 <code>conversion</code> type. Future TODO.</li></ul><h2>TODOs</h2><ul><li>unit tests using <code>multipart/*</code> (JavaMail) reading andwriting records? Try <code>record-id</code> as part boundary.</li><li>Performance: Need to add Record-based buffering. GZIP'd streamshave some buffering because of the deflater but could probably dow/ more.</li></ul><P><P><DL></DL><HR><!-- ======= START OF BOTTOM NAVBAR ====== --><A NAME="navbar_bottom"><!-- --></A><A HREF="#skip-navbar_bottom" title="Skip navigation links"></A><TABLE BORDER="0" WIDTH="100%" CELLPADDING="1" CELLSPACING="0" SUMMARY=""><TR><TD COLSPAN=2 BGCOLOR="#EEEEFF" CLASS="NavBarCell1"><A NAME="navbar_bottom_firstrow"><!-- --></A><TABLE BORDER="0" CELLPADDING="0" CELLSPACING="3" SUMMARY=""> <TR ALIGN="center" VALIGN="top"> <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1"> <A HREF="../../../../../overview-summary.html"><FONT CLASS="NavBarFont1"><B>Overview</B></FONT></A> </TD> <TD BGCOLOR="#FFFFFF" CLASS="NavBarCell1Rev"> <FONT CLASS="NavBarFont1Rev"><B>Package</B></FONT> </TD> <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1"> <FONT CLASS="NavBarFont1">Class</FONT> </TD> <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1"> <A HREF="package-use.html"><FONT CLASS="NavBarFont1"><B>Use</B></FONT></A> </TD> <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1"> <A HREF="package-tree.html"><FONT CLASS="NavBarFont1"><B>Tree</B></FONT></A> </TD> <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1"> <A HREF="../../../../../deprecated-list.html"><FONT CLASS="NavBarFont1"><B>Deprecated</B></FONT></A> </TD> <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1"> <A HREF="../../../../../index-all.html"><FONT CLASS="NavBarFont1"><B>Index</B></FONT></A> </TD> <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1"> <A HREF="../../../../../help-doc.html"><FONT CLASS="NavBarFont1"><B>Help</B></FONT></A> </TD> </TR></TABLE></TD><TD ALIGN="right" VALIGN="top" ROWSPAN=3><EM></EM></TD></TR><TR><TD BGCOLOR="white" CLASS="NavBarCell2"><FONT SIZE="-2"> <A HREF="../../../../../org/archive/io/warc/package-summary.html"><B>PREV PACKAGE</B></A> <A HREF="../../../../../org/archive/net/package-summary.html"><B>NEXT PACKAGE</B></A></FONT></TD><TD BGCOLOR="white" CLASS="NavBarCell2"><FONT SIZE="-2"> <A HREF="../../../../../index.html?org/archive/io/warc/v10/package-summary.html" target="_top"><B>FRAMES</B></A> <A HREF="package-summary.html" target="_top"><B>NO FRAMES</B></A> <SCRIPT type="text/javascript"> <!-- if(window==top) { document.writeln('<A HREF="../../../../../allclasses-noframe.html"><B>All Classes</B></A>'); } //--></SCRIPT><NOSCRIPT> <A HREF="../../../../../allclasses-noframe.html"><B>All Classes</B></A></NOSCRIPT></FONT></TD></TR></TABLE><A NAME="skip-navbar_bottom"></A><!-- ======== END OF BOTTOM NAVBAR ======= --><HR>Copyright © 2003-2007 Internet Archive. All Rights Reserved.</BODY></HTML>
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -