YouTube lua script fails on very long lines

The point is to prevent starving memory with an invalid stream. Removing the limit would not only be a security issue in itself, but it would expose a ton of potential integer overflow error in call sites.

It's not really a "text" file if you have 100,000 character long "lines" in it anyway.

added Status::will not fix label

closed

Problem is real. I just don't think it is in the core. vlc_stream_ReadLine() is meant to read, well, a line in a line-based file format. This is useful for most playlist and some subtitle formats in particular.

XML and HTML are hardly line-based: the new line only has meanings in preformatted text. And Javascript/JSON is also most not line-based. The only line-based aspect of it that I know are the C++-style comments.

added Component::LUA scripts label and removed Status::will not fix label

reopened

XML and HTML are hardly line-based: the new line only has meanings in preformatted text. And Javascript/JSON is also most not line-based.

Good point. Fair enough, I gave it a try at rethinking how lua playlist scripts could be organized around a different paradigm than the admittedly occasionally awkward line-based approach.

However, vlc.readline() is the only useful API available to lua playlist scripts to fetch the web page data. The XML API allows passing a VLC stream object to plug it into it, however it's not currently possible to access the lua script's underlying stream object for that purpose. But if it was, the problematic point we have here with the very long lines is usually to recover the JSON contents of some HTML <script> tag: and if I understand correctly the XML parser code, that involves a big strdup(), followed by pushing it as a lua string. So we're still hitting the same issue of reading it all until the end into a very long string.

We would need a JSON parser that accepts I/O callbacks using the VLC stream, and then the ability to somehow chain the VLC stream from the XML parser's <script> tag to the JSON parser.

But keep in mind that it's not actually JSON that we have to deal with there, it's JavaScript and often escaped JSON or other data, as in the YouTube example. So we'd also need to insert into the stream processing a custom unescaping filter suited to the current needs.

And even when we do have the corresponding unescaping facilities... The YouTube API returns application/x-www-form-urlencoded data, but those first-level fields are few and very long segments of URL-encoded data. And our URL-decoding utilities again take in whole strings, not stream readers.

So we can try, but it's intractable. It's not realistically cost-effective to do away with the reading of very long text blobs into single continuous memory strings of arbitrary length. What are the other solutions?

The point is to prevent starving memory with an invalid stream. Removing the limit would not only be a security issue in itself, but it would expose a ton of potential integer overflow error in call sites.

vlc_stream_ReadLine() returns a string pointer. So which integer overflows, on strlen() or something similar? Anyway, the current 204,800 limit is not in line with overflow concerns. And the memory starving must be an otherly pervasive concern for a media player accepting network streams of several Mbps and passing them to 4k video ring buffers. Bumping the line limit from 200 kB to 1 MB, 4 MB or even 16 MB should still be far below causing any real issue of the kind you mention. So maybe we could do that, and still keep a limit that protects against those issues. Also, we're talking about web assets that get minified and optimized, for performance on mobile notably, and whose weight doesn't get above several MB: so we have an expectation of what would indeed be enough solve the problem.

So I think the best solutions are:

Bumping up the current limit in the stream code. Even though it's outside the scope of lined-base file formats that you assert for this function, it causes no issue or downside, doesn't add complexity, and expands support.
Copy-pasting the vlc_stream_ReadLine() code into the lua module, and enabling long line support there, since that implementation will definitely include parsing HTML, JSON and web API data within its scope. This still seems absurd to me though.
Leaving YouTube playback broken and getting more and more broken. Users seem to find this feature pretty popular though.

Hardcoded arbitrary limits are supposed to be poor design and something to avoid. I'm surprised to see you take this side, I had actually imagined that if not me, it would have been just like you to point out that the limit could even be simply done away with.

See also discussion in https://mailman.videolan.org/pipermail/vlc-devel/2020-July/135441.html

Other solutions put forward:

supplementing the vlc_stream_ReadLine() API with a way to pass a parameter to control or override the limit, rather than keeping it hardcoded
bumping up the limit in the stream code, and adding another limit similar to the current one in a code path specific to playlist and subtitles parsers expecting short lines

A possible work around for users would be to keep repeatedly retrying to open the YouTube URL, until it works.

added Component::LUA run-time label and removed Component::LUA scripts label

User report #25002 (closed) marked as duplicate

User report #25037 (closed) marked as duplicate

I've pushed https://git.videolan.org/?p=vlc.git;a=commitdiff;h=ca39a2a766cdf751b4a487f9ec3650846f1a0289 to let youtube.lua detect the issue and print a diagnostic log linking to this ticket. This way users don't lose time wondering or trying to figure out what's going on, and we don't lose time checking whether new reports are duplicates of this.

Resolution of the issue is unfortunately stalled at the moment, but user reports and feedback are still welcome here.

User report #25042 (closed) marked as duplicate

YouTube playback was fixed by e3378ac3 The fix will be available in the next VLC 3.0.12 release. Meanwhile, affected users can fix it without waiting, as usual by updating to the latest version of the youtube.lua script, available at https://git.videolan.org/?p=vlc.git;a=blob_plain;f=share/lua/playlist/youtube.lua , following the instructions on https://www.latecnosfera.com/2016/10/vlc-unable-to-open-mrl.html ; or by installing a VLC 3.0 nightly build from https://nightlies.videolan.org/

added Component::LUA scripts Status::fixed labels and removed Component::LUA run-time label

changed milestone to %3.0.x maintenance

closed

soundcloud.lua has now been affected by this too, with JavaScript web assets one-liners up to 1 MB: see #25508 (closed)

mentioned in issue #25002 (closed)

mentioned in issue #25037 (closed)

mentioned in issue #25042 (closed)

mentioned in issue #25124 (closed)

mentioned in issue #25188 (closed)

mentioned in issue #25223 (closed)

mentioned in issue #25508 (closed)

changed milestone to %3.0.12

Noticed this issue because vlc-android has a patch to fix this (4 MB limit). This issue doesn't seem to be solved, as least for Lua.

We could simply pass a parameter to stream_ReadLine() and decide what is a good line size based on the context.

First, for large buffers, chunking is generally preferable. And if it's really not feasible, you should use normal raw byte arrays, rather than pretend you're reading a line.

It makes zero sense for a line to be 4 MiB long, and the line reading code is actually very inefficient (and that's kind of intrinsic and unfixable).

For the record, that has been discussed during a previous workshop and that was already the conclusion of the meeting IIRC.

It makes zero sense for a line to be 4 MiB long

If you have 200 KB of data which does not contain a single \n, it's not random data. It's really text that should be handled as such. IMO there shouldn't even be a limit.

What file formats reasonably gets 200,000 character-long lines and is actually a textine-based format? How will you prevent DoS attacks if you don't process such file format in sub-line chunks?

XML and JSON are not line-based, for that matter. Line feeds are just white space then. Point being, not all text-based formats are line-based.

Especially when looking for a size-bounded pattern, you don't need to read an unbound amount of data. The goal is not even to use the full line so readline was just a convenient way to slice the incoming data.

How will you prevent DoS attacks if you don't process such file format in sub-line chunks?

How much memory does a 4k video ring buffer take? But VLC won't be able to withstand loading in memory more than 200 kB of text at a time?... Can I complain about DoS attacks too when someone sends me a picture or video whose resolution is so unexpectedly high that it crashes my system?

Also how will you prevent DoS attacks when the prescribed solution has been to use a not-line API to read an even bigger single not-chunk of several MB to try and make sure the read wasn't sub-line? How will you prevent DoS attacks when the user will just stitch the chunks back together into a single big string to pass it to a JSON parser or JavaScript interpreter or even just a regexp match?

How much memory does a 4k video ring buffer take?

A fixed size that can be allocated in constant time. And how often do you decode 4K video in CPU anyway?

If you want to allocate a 16 MiB fixed-size buffer for parsing, that's fine with me. The prefetch module already does that. The problems are the variable size, the peeking and the iterative behaviour of reading a line.

With hindsight and experience here's my current take on this. The issue was actually compounded by bugs and limitations in the stream_ReadLine() API: if it failed to get a whole line, not only would it return nothing, but the searched data would actually get consumed and lost. Now with the current implementation in lua scripts, it still does return the maximum data amount, which even if the line is incomplete, still gives you a fairly good chance that you'll find what you were looking for among the returned data.

So I'm not convinced that adding a size limit parameter to the API makes a lot of sense semantically: if you have assumptions or expectations about maximum line length, just read that much data already and find the EOL yourself. If we're talking about extending the API, I think it would be more useful to accept a match token, e.g. </script>, instead of a linefeed as is hardcoded now, to ask the API to return data until that token is met.

It makes zero sense for a line to be 4 MiB long, and the line reading code is actually very inefficient (and that's kind of intrinsic and unfixable).

Well then the API to return long complete text blobs for further use might make sense. Otherwise what's being done now, doing line reading in lua through crosslanguage APIs, can only be even less efficient than doing it in the C core. Also IIRC I refactored that subsystem from quadratic to linear complexity - and again, it is more likely that core code will be properly optimized, than hoping for user code to be written correctly.

For the record, that has been discussed during a previous workshop and that was already the conclusion of the meeting IIRC.

What was the conclusion?

@robUx4 Do you actually have a problem, apart from an outstanding and maybe obsolete patch? Lua scripts that were affected by this were fixed, they read up to 1 MB or 4 MB, and that's been just fine like that so far.

What was the conclusion?

Exactly what you suggest above in this post. :)

just read that much data already and find the EOL yourself

In this year GSoC, I also wanted to add stream parsers in C, in particular for regex, parenthesis and braces scopes, json and xml, but we didn't get enough time for that with the student. The goal would have been to write the parsing rules in a declarative way instead of writing the parser itself, easing the maintenance of the playlist scripts and improving their speed.

@robUx4 Do you actually have a problem, apart from an outstanding and maybe obsolete patch?

No, I just saw this patch in Android and wondered if it could be dropped. It mentions this issue as the cause for the patch. If Lua was the only issue then we can probably remove it. cc @tguillem

Hey I remember that patch from when I submitted it and it was promptly rejected! Why was that patch applied in the first place after it had been rejected in the main line?... I think it should be dropped, yes.

The Android and iOS, libvlcsharp ports often use patches not found in master because it solves their particular use case well enough (or allow them to build live555). I agree it's wrong but with no better solution in sight it's better than nothing.

Just as we should try to upstream all the patches we have in contribs, the ports should make sure their particular fix, or other solutions to an actual problem, are eventually fixed in master. In this particular case I think the patch can be dropped.

YouTube lua script fails on very long lines

Child items 0

Activity