Page MenuHomePhabricator

PageHTMLHandler should access Parsoid directly
Closed, ResolvedPublic

Description

PageHTMLHandler currently makes an HTTP request to RESTbase to fetch Parsoid HTML. It should instead call Parsoid directly. The critical aspect to consider is caching. The REST endpoint handled by PageHTMLHandler is not yet used for any public functionality, so cache capacity is not yet a problem. However, the intent is to move us towards eventually using the core REST API as a backend for VisualEditor. For that to be feasible, we need a caching strategy.

Skope of this task: Make PageHTMLHandler use ParserCache, and call Parsoid directly. Wrap Parsoid output in a ParserOutput object. Note that this may cause a lot of data to be added to ParserCache if something started to use the PageHTMLHandler endpoint heavily.

Later tasks to consider:

  • Pre-generating parsoid output after edits, and populate the relevant key in the ParserCache.
  • Putting more information into ParserOutput generated by parsoid output, eventually creating parity with the old parser's output, so parsoid can drive secondary data updates (in particular, LinksUpdate).
  • Create an API in core that can be used by VisualEditor, with fully parity to RESTbase. In a first step, it can be simply a proxy to RESTbase for functionality still missing in core. RESTbase can then be made internal.
  • Slowly shift cache capacity from RESTbase into ParserCache: Add a mode to RESTbase that will return HTML only if it's already cached. In core, when content is not in ParserCache, ask RESTbase for that content before trying to regenerate. If it's in RESTbase, fetch it from there, store it in ParserCache, and delete from RESTbase.

Event Timeline

Change 634021 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] REST API: inject TitleFactory

https://gerrit.wikimedia.org/r/634021

Change 634021 merged by jenkins-bot:
[mediawiki/core@master] REST API: inject TitleFactory

https://gerrit.wikimedia.org/r/634021

Change 633780 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/core@master] Use parsoid directly in /page/html handler

https://gerrit.wikimedia.org/r/633780

Change 633780 merged by jenkins-bot:
[mediawiki/core@master] Use parsoid directly in /page/html handler

https://gerrit.wikimedia.org/r/633780

Change 635086 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] Enable parsoid on api_appserver

https://gerrit.wikimedia.org/r/635086

Change 635095 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] Enable Parsoid REST API when loading it

https://gerrit.wikimedia.org/r/635095

Change 635096 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/parsoid@master] Add setting to enable/disable REST API.

https://gerrit.wikimedia.org/r/635096

curl https://test.wikipedia.org/w/rest.php/v1/page/Main_Page/html

<blablabla>

<!-- Saved in parser cache with key testwiki:parsoid:idhash:11791-0!canonical and timestamp 20201202140952 and revision id 448939. Serialized with JSON.
 -->

Boom!

Pchelolo claimed this task.

Rolled out on all wikis. Success.