I believe that this issue has been causing occasional failures already: apparently the exact content of a YouTube page, in particular the configuration lines, is never the same from one access to another; and I can only surmise that it differs further depending on whether or not it's VLC accessing it. The main configuration line used by the YouTube lua script, containing ytplayer.config, is routinely over 100,000 characters long, and it's not even the longest line of the HTML page.
This example video has been failing most of the time for me in a rather reproducible way (not always though): https://www.youtube.com/watch?v=QK8mJJJvaes Debugging indicates that when arriving to the very long config line, vlc.readline() instead returns nil as if for EOF. An even more striking example would be https://www.youtube.com/get_video_info?video_id=QK8mJJJvaes&el=detailpage which is a normally supported API that contains all the configuration on a single line, now over 400,000 characters long: vlc.readline() returns nil straight away.
My testing leads me to believe that these inputs hit the hardcoded STREAM_LINE_MAX limit of 204,800 bytes used in stream_ReadLine() in src/input/stream.c. This limit looks like a safety check; however, what is its actual technical purpose, if any?
Furthermore, even if you mistakenly call stream_ReadLine() on a random binary stream, the chances of coming across a random 0x0a line feed within 204,800 bytes are pretty high; so hitting the limit would rather be a clue that what is being parsed is indeed text, with a deliberately very long line: in which case the expected behavior would definitely be to go on and parse it.
Obvious fixes would be to bump the limit, or simply do away with it if it serves no real purpose.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related.
Learn more.
The point is to prevent starving memory with an invalid stream. Removing the limit would not only be a security issue in itself, but it would expose a ton of potential integer overflow error in call sites.
It's not really a "text" file if you have 100,000 character long "lines" in it anyway.
Problem is real. I just don't think it is in the core. vlc_stream_ReadLine() is meant to read, well, a line in a line-based file format. This is useful for most playlist and some subtitle formats in particular.
XML and HTML are hardly line-based: the new line only has meanings in preformatted text.
And Javascript/JSON is also most not line-based. The only line-based aspect of it that I know are the C++-style comments.
XML and HTML are hardly line-based: the new line only has meanings in preformatted text. And Javascript/JSON is also most not line-based.
Good point. Fair enough, I gave it a try at rethinking how lua playlist scripts could be organized around a different paradigm than the admittedly occasionally awkward line-based approach.
However, vlc.readline() is the only useful API available to lua playlist scripts to fetch the web page data. The XML API allows passing a VLC stream object to plug it into it, however it's not currently possible to access the lua script's underlying stream object for that purpose. But if it was, the problematic point we have here with the very long lines is usually to recover the JSON contents of some HTML <script> tag: and if I understand correctly the XML parser code, that involves a big strdup(), followed by pushing it as a lua string. So we're still hitting the same issue of reading it all until the end into a very long string.
We would need a JSON parser that accepts I/O callbacks using the VLC stream, and then the ability to somehow chain the VLC stream from the XML parser's <script> tag to the JSON parser.
But keep in mind that it's not actually JSON that we have to deal with there, it's JavaScript and often escaped JSON or other data, as in the YouTube example. So we'd also need to insert into the stream processing a custom unescaping filter suited to the current needs.
And even when we do have the corresponding unescaping facilities... The YouTube API returns application/x-www-form-urlencoded data, but those first-level fields are few and very long segments of URL-encoded data. And our URL-decoding utilities again take in whole strings, not stream readers.
So we can try, but it's intractable. It's not realistically cost-effective to do away with the reading of very long text blobs into single continuous memory strings of arbitrary length. What are the other solutions?
The point is to prevent starving memory with an invalid stream. Removing the limit would not only be a security issue in itself, but it would expose a ton of potential integer overflow error in call sites.
vlc_stream_ReadLine() returns a string pointer. So which integer overflows, on strlen() or something similar? Anyway, the current 204,800 limit is not in line with overflow concerns. And the memory starving must be an otherly pervasive concern for a media player accepting network streams of several Mbps and passing them to 4k video ring buffers. Bumping the line limit from 200 kB to 1 MB, 4 MB or even 16 MB should still be far below causing any real issue of the kind you mention. So maybe we could do that, and still keep a limit that protects against those issues. Also, we're talking about web assets that get minified and optimized, for performance on mobile notably, and whose weight doesn't get above several MB: so we have an expectation of what would indeed be enough solve the problem.
So I think the best solutions are:
Bumping up the current limit in the stream code. Even though it's outside the scope of lined-base file formats that you assert for this function, it causes no issue or downside, doesn't add complexity, and expands support.
Copy-pasting the vlc_stream_ReadLine() code into the lua module, and enabling long line support there, since that implementation will definitely include parsing HTML, JSON and web API data within its scope. This still seems absurd to me though.
Leaving YouTube playback broken and getting more and more broken. Users seem to find this feature pretty popular though.
Hardcoded arbitrary limits are supposed to be poor design and something to avoid. I'm surprised to see you take this side, I had actually imagined that if not me, it would have been just like you to point out that the limit could even be simply done away with.
supplementing the vlc_stream_ReadLine() API with a way to pass a parameter to control or override the limit, rather than keeping it hardcoded
bumping up the limit in the stream code, and adding another limit similar to the current one in a code path specific to playlist and subtitles parsers expecting short lines
A possible work around for users would be to keep repeatedly retrying to open the YouTube URL, until it works.
First, for large buffers, chunking is generally preferable. And if it's really not feasible, you should use normal raw byte arrays, rather than pretend you're reading a line.
It makes zero sense for a line to be 4 MiB long, and the line reading code is actually very inefficient (and that's kind of intrinsic and unfixable).
If you have 200 KB of data which does not contain a single \n, it's not random data. It's really text that should be handled as such. IMO there shouldn't even be a limit.
What file formats reasonably gets 200,000 character-long lines and is actually a textine-based format? How will you prevent DoS attacks if you don't process such file format in sub-line chunks?
XML and JSON are not line-based, for that matter. Line feeds are just white space then. Point being, not all text-based formats are line-based.
Especially when looking for a size-bounded pattern, you don't need to read an unbound amount of data. The goal is not even to use the full line so readline was just a convenient way to slice the incoming data.
How will you prevent DoS attacks if you don't process such file format in sub-line chunks?
How much memory does a 4k video ring buffer take? But VLC won't be able to withstand loading in memory more than 200 kB of text at a time?... Can I complain about DoS attacks too when someone sends me a picture or video whose resolution is so unexpectedly high that it crashes my system?
Also how will you prevent DoS attacks when the prescribed solution has been to use a not-line API to read an even bigger single not-chunk of several MB to try and make sure the read wasn't sub-line? How will you prevent DoS attacks when the user will just stitch the chunks back together into a single big string to pass it to a JSON parser or JavaScript interpreter or even just a regexp match?
A fixed size that can be allocated in constant time. And how often do you decode 4K video in CPU anyway?
If you want to allocate a 16 MiB fixed-size buffer for parsing, that's fine with me. The prefetch module already does that. The problems are the variable size, the peeking and the iterative behaviour of reading a line.
With hindsight and experience here's my current take on this. The issue was actually compounded by bugs and limitations in the stream_ReadLine() API: if it failed to get a whole line, not only would it return nothing, but the searched data would actually get consumed and lost. Now with the current implementation in lua scripts, it still does return the maximum data amount, which even if the line is incomplete, still gives you a fairly good chance that you'll find what you were looking for among the returned data.
So I'm not convinced that adding a size limit parameter to the API makes a lot of sense semantically: if you have assumptions or expectations about maximum line length, just read that much data already and find the EOL yourself. If we're talking about extending the API, I think it would be more useful to accept a match token, e.g. </script>, instead of a linefeed as is hardcoded now, to ask the API to return data until that token is met.
It makes zero sense for a line to be 4 MiB long, and the line reading code is actually very inefficient (and that's kind of intrinsic and unfixable).
Well then the API to return long complete text blobs for further use might make sense. Otherwise what's being done now, doing line reading in lua through crosslanguage APIs, can only be even less efficient than doing it in the C core. Also IIRC I refactored that subsystem from quadratic to linear complexity - and again, it is more likely that core code will be properly optimized, than hoping for user code to be written correctly.
For the record, that has been discussed during a previous workshop and that was already the conclusion of the meeting IIRC.
What was the conclusion?
@robUx4 Do you actually have a problem, apart from an outstanding and maybe obsolete patch? Lua scripts that were affected by this were fixed, they read up to 1 MB or 4 MB, and that's been just fine like that so far.
just read that much data already and find the EOL yourself
In this year GSoC, I also wanted to add stream parsers in C, in particular for regex, parenthesis and braces scopes, json and xml, but we didn't get enough time for that with the student. The goal would have been to write the parsing rules in a declarative way instead of writing the parser itself, easing the maintenance of the playlist scripts and improving their speed.
@robUx4 Do you actually have a problem, apart from an outstanding and maybe obsolete patch?
No, I just saw this patch in Android and wondered if it could be dropped. It mentions this issue as the cause for the patch. If Lua was the only issue then we can probably remove it. cc @tguillem
Hey I remember that patch from when I submitted it and it was promptly rejected! Why was that patch applied in the first place after it had been rejected in the main line?... I think it should be dropped, yes.
The Android and iOS, libvlcsharp ports often use patches not found in master because it solves their particular use case well enough (or allow them to build live555). I agree it's wrong but with no better solution in sight it's better than nothing.
Just as we should try to upstream all the patches we have in contribs, the ports should make sure their particular fix, or other solutions to an actual problem, are eventually fixed in master. In this particular case I think the patch can be dropped.