develooper Front page | perl.perl5.porters | Postings from July 2019

Re: "Damaged tar archive" for perl-5.32.2 on ftp callsfromCPANmirrors

Thread Previous | Thread Next
From:
Deven T. Corzine
Date:
July 25, 2019 06:22
Subject:
Re: "Damaged tar archive" for perl-5.32.2 on ftp callsfromCPANmirrors
Message ID:
CAFVdu0RKxQGfkGL9cP7Jp+N-GPDbqaQES1TKaJNvV1V+r_uw7A@mail.gmail.com
On Mon, Jul 22, 2019 at 9:58 AM James E Keenan <jkeenan@pobox.com> wrote:

> On 7/22/19 1:07 AM, Tony Cook wrote:
> > On Sun, Jul 21, 2019 at 09:49:04PM -0400, James E Keenan wrote:
> >>>
> >>> Otherwise, since you're using FTP, do you call binary() on the
> >>> Net::FTP object?
> >>>
> >>
> >> No, I don't.  But I've performed this call (inside Perl::Download::FTP)
> >> hundreds of times with no problem -- and now dozens of times from within
> >> this VM.
> >
> > Try it, I expect you've just been lucky in your choice of mirrors.
> >
> > If you're transferring binary files always use binary mode for FTP.
> >
> > You might want to check the checksums too.
> >
> > Tony
> >
>
> Tony, ask, sisyphus ... thanks as always for your contributions.
>
> [sisyphus is now in his 20th year of giving me helpful suggestions for
> Perl!]
>
> Here is what I have done to move things forward, specifically, to get
> the CPAN-River-3000 to run for perl-5.31.2.
>
> 1. On github, I have created a new branch for my CPAN distro
> Perl-Download-FTP:
>
> https://github.com/jkeenan/perl-download-ftp/tree/binary
>
> In this branch, following Tony's suggestion, I have added a call to the
> Net::FTP::binary() method before each call to Net::FTP::get().  No
> changes yet in test suite or documentation, but all tests PASS including
> live network tests.
>

Jim,

Tony's advice is exactly the right answer here.  Enabling binary mode is
more than a good suggestion -- it's an essential bug fix, with virtually
zero chance of breaking anything.  This change should fix the original
issue.  Please merge this branch and make a new release for it!

You said you've downloaded binary files via FTP hundreds of times without
any problem, so using binary mode might seem to be irrelevant or
unnecessary.  However, as Tony said, you've just been lucky so far.  Binary
mode really isn't optional here -- it's required to reliably transfer
binary data via FTP.

TL;DR: The full detailed explanation below is VERY long -- feel free to
skip reading it!  The short summary is that FTP implementations on
identical platforms (e.g. Linux FTP client downloading from Linux FTP
server) might work perfectly every time, even when transferring binary data
in the default ASCII mode.  However, this is misleading because dissimilar
platforms (e.g. Windows FTP client downloading from Linux FTP server) will
always cause data corruption of binary data if sent in ASCII mode.  The
moral of the story is to ALWAYS use binary mode to transfer binary data via
FTP, whether or not it seems to be necessary.




Hopefully SOMEONE will find the detailed explanation below to be
interesting, or helpful in understanding this issue!




Okay, so this begs the obvious question.  If binary mode is "required",
then why did it work hundreds of times without using binary mode in the
first place??

The answer lies in the FTP protocol itself, which is defined by RFC 959:

     https://tools.ietf.org/html/rfc959 (FTP)

RFC 959 references the TELNET protocol, which is defined by RFC 854:

     https://tools.ietf.org/html/rfc854 (TELNET)

Irrelevant but interesting piece of trivia: the FTP control connection is
actually defined as a TELNET connection!  (Does everyone cheat and use TCP
instead?)

The FTP specification defines the default ASCII data type in section 3.1.1.1
:

3.1.1.1.  ASCII TYPE
>
> This is the default type and must be accepted by all FTP implementations.
> It is intended primarily for the transfer of text files, except when both
> hosts would find the EBCDIC type more convenient.
>
> The sender converts the data from an internal character representation to
> the standard 8-bit NVT-ASCII representation (see the Telnet
> specification).  The receiver will convert the data from the standard form
> to his own internal form.
>
> In accordance with the NVT standard, the <CRLF> sequence should be used
> where necessary to denote the end of a line of text.  (See the discussion
> of file structure at the end of the Section on Data Representation and
> Storage.)
>
> Using the standard NVT-ASCII representation means that data must be
> interpreted as 8-bit bytes.
>
> The Format parameter for ASCII and EBCDIC types is discussed below.


This default data type should only be used for simple ASCII text files.
Binary data -- including text files containing non-ASCII characters (e.g.
UTF-8) -- should instead be sent using the "image" data type, which is
defined in section 3.1.1.3:

3.1.1.3.  IMAGE TYPE
>
> The data are sent as contiguous bits which, for transfer, are packed into
> the 8-bit transfer bytes.  The receiving site must store the data as
> contiguous bits.  The structure of the storage system might necessitate the
> padding of the file (or of each record, for a record-structured file) to
> some convenient boundary (byte, word or block).  This padding, which must
> be all zeros, may occur only at the end of the file (or at the end of each
> record) and there must be a way of identifying the padding bits so that
> they may be stripped off if the file is retrieved.  The padding
> transformation should be well publicized to enable a user to process a file
> at the storage site.
>
> Image type is intended for the efficient storage and retrieval of files
> and for the transfer of binary data.  It is recommended that this type be
> accepted by all FTP implementations.


This "image" data type is usually described as "binary mode", and that's
the easiest way to think of it.  Technically, the "binary" command in the
command-line "ftp" client actually sends "TYPE I" at the protocol level to
select the "image" data type, which is the appropriate data type for binary
data files.  (This is also what Net::FTP::binary() will do.)

To understand why transferring binary files in ASCII mode sometimes works
and sometimes fails, it helps to understand the "Network Virtual Terminal"
(NVT) and its default ASCII mode.  (This is what "NVT-ASCII" in the FTP
specification is referring to.)  The NVT is carefully defined in the TELNET
specification.  The following paragraphs (on pages 11-12 of RFC 854) are of
particular interest:

The sequence "CR LF", as defined, will cause the NVT to be positioned at
> the left margin of the next print line (as would, for example, the sequence
> "LF CR").  However, many systems and terminals do not treat CR and LF
> independently, and will have to go to some effort to simulate their effect.
>  (For example, some terminals do not have a CR independent of the LF, but
> on such terminals it may be possible to simulate a CR by backspacing.)
>  Therefore, the sequence "CR LF" must be treated as a single "new line"
> character and used whenever their combined action is intended; the sequence
> "CR NUL" must be used where a carriage return alone is actually desired;
> and the CR character must be avoided in other contexts.  This rule gives
> assurance to systems which must decide whether to perform a "new line"
> function or a multiple-backspace that the TELNET stream contains a
> character following a CR that will allow a rational decision.
>
> Note that "CR LF" or "CR NUL" is required in both directions (in the
> default ASCII mode), to preserve the symmetry of the NVT model.  Even
> though it may be known in some situations (e.g., with remote echo and
> suppress go ahead options in effect) that characters are not being sent to
> an actual printer, nonetheless, for the sake of consistency, the protocol
> requires that a NUL be inserted following a CR not followed by a LF in the
> data stream.  The converse of this is that a NUL received in the data
> stream after a CR (in the absence of options negotiations which explicitly
> specify otherwise) should be stripped out prior to applying the NVT to
> local character set mapping.


In the FTP protocol, the ASCII type requires 8-bit transfer bytes to be
used to represent 7-bit ASCII characters.  There's no guarantee that
arbitrary 8-bit data can be sent, since any byte with the high bit set
(128-255) is outside the ASCII range.  The FTP implementation could
potentially discard bytes with the high bit set, or strip the high bit and
send the other 7 bits of that byte.  These things wouldn't violate the FTP
protocol, since the ASCII data type only guarantees the ability to send
ASCII text.  In practice, it's common for 8-bit data to be sent and
received anyhow, ignoring the fact that ASCII is defined as a 7-bit
character set.  This is entirely implementation-specific, so there's no
guarantee that this will work with any particular FTP server or FTP
client.  Hypothetically, there could even be an FTP implementation that
converts "text" between Unicode (e.g. UTF-8 or UCS-2) and Latin-1 or
Windows-1252, to the best of its ability -- a translation which would badly
corrupt binary data files.

In practice, the most frequent cause of corruption involves newlines.  For
example, if a Windows FTP client is downloading a file from a Linux FTP
server in ASCII mode, each implementation is required to translate its
local newlines to/from the standard "CF LF" 2-byte sequence on the wire.
If this is done with arbitrary binary data such as a compressed tar file or
zip file, every time a byte value of 10 (LF) occurs in that binary data,
the Linux FTP server MUST prefix this with byte 13 (CR) in order to
translate this to "CR LF" on the wire, because the local conventions is to
represent a newline as "LF" alone.  Meanwhile, the Windows FTP client is
simultaneously required to translate that "CR LF" to the local newline
convention, which happens to be the same "CR LF" on Windows.  Neither FTP
implementation is allowed to skip these newline conversions; they are
required by the FTP protocol when using the default ASCII mode.

In the example above, the Windows FTP client would end up writing a data
file where an extra byte 13 (CR) has been silently inserted before every
byte 10 (LF) in the binary data.  This changes the size of the file and
also corrupts the compressed data stream, making it impossible to
uncompress it correctly.  Corruption like this is VERY common when binary
files are transferred in ASCII mode, which is why the standard practice is
to ALWAYS use binary mode (image type) to transfer binary data via FTP.
That is the correct data type for binary data, and the only one which will
reliably transfer binary data between various system types without data
corruption like this.  (The exact nature of the corruption will vary
depending on the local newline conventions at each end, but all of them
irreparably damage compressed data.)

However, as you experienced, sometimes ASCII mode transfers a binary file
just fine, with NO data corruption.  In general, this is just luck of the
draw -- the only way this can happen is for the FTP client and FTP server
to be using the same local newline conventions, and otherwise send 8-bit
clean data through the ASCII data transfer.  This is a best-case scenario,
and can mislead people into believing that it's fine to transfer binary
data over FTP in the default ASCII mode.

As a "lucky" best-case example, consider a Linux FTP client downloading a
binary data file from a Linux FTP server.  (Unix or BSD could apply easily
in place of Linux on either side here.)  In this example, the FTP server is
still required to prefix each byte 10 (LF) in the original data file with
byte 13 (CR), sending "CR LF" on the wire, representing a newline.  In the
best-case scenario, the Linux FTP server would also follow each byte 13
(CR) in the original data file with an added byte 0 (NUL), translating to
"CR NUL" on the wire.  (This means that "CR LF" in the original data would
be sent on the wire as "CR NUL CR LF".)  The Linux FTP client would then
remove each byte 13 (CR) which is immediately followed by byte 10 (LF),
translating "CR LF" on the wire back to the "LF" local newline convention.
It would also discard every byte 0 (NUL) which is immediately preceded by
byte 13 (CR), translating "CR NUL" on the wire back to "CR" locally.  If
the rest of the 8-bit data is passed through cleanly, these reversible
translations actually allow the binary data to be successfully transferred
without corruption between these particular FTP implementations.

Since Linux FTP client and Linux FTP server would be a common environment
for testing, this could easily make it appear that ASCII mode is perfectly
reliable for binary data transfers, causing an unpleasant surprise later
when someone tries a pairing of FTP implementations/platforms which cause
data corruption.

Today, it's a bit of a nuisance that ASCII is the default mode, since
straight ASCII text files are becoming less and less common (often replaced
with UTF-8 now) and compressed binary data has become extremely common
(e.g. compressed tar files, zip files, document formats with XML inside a
Zip container, etc.).  Using binary mode is more often what's needed these
days, but defaulting to ASCII mode was a very sensible default back in the
1970s where various computers in a widely-diverse heterogeneous environment
needed to exchange text files cleanly.  Back then, it was much more common
to see strange byte/word sizes (e.g. 9-bit bytes or 36-bit words), weird
ASCII-incompatible character sets like EBCDIC (which FTP actually has a
special data type for), etc.  Converting everything to ASCII on the wire
with standardized newlines made sense in that environment.

The computing environment now (in 2019) is improving.  For example, 8-bit
bytes have been the norm for decades now, and UTF-8 is becoming
increasingly common as a default character set to use for interoperability,
exported data, portable data files, document formats, XML, etc.  On the
other hand, differences in newline conventions still remain a hassle to
this day, as in the example above.  Hopefully someday the norm will be to
use UTF-8 everywhere (as Perl does), including in databases, and for
Unix-style LF newlines to become the norm.  Apple finally moved from the
historical CR-only newlines which MacOS had always used, to Unix-style LF
newlines as the default in OS X.  (Surely this is no coincidence, since OS
X is built on top of Darwin, a BSD Unix variant developed by Apple.)

If Windows were to make the same transition to Unix-style LF newlines,
incompatabilities related to newline differences would tend to disappear.
If Windows and Java were to switch from UCS-2 to UTF-8 as the default
Unicode encoding, that would also make life a lot better all around.
Unfortunately, these changes don't seem likely any time in the foreseeable
future.  C'est la vie!

Sorry this email grew to book length!  Hopefully someone out there finds
these details illuminating...

Deven

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About