You are not logged in.

  • "Ray" is male
  • "Ray" started this thread

Posts: 1,083

Date of registration: May 23rd 2011

Language Team: Global

Focus Group: LTI Administration Group

Location: Michigan, US

Thanks: 41429 / 6380

  • Send private message

1

Friday, July 19th 2013, 3:00am

Fully searchable and indexed database of all materials across all languages

Search field should be able to handle terms & properly parse “phrases”

Input accepts SRT, text, html, and perhaps other formats that can be parsed & represented internally
Unicode/UTF-8 conversion upon upload, as needed

Output provides Title of material, linked to specific passage or time-stamped video, as well as overall (non-located) material access.
Output ties into the Glossary described below, so that terms & phrases found in the Glossary are underscored with a dashed line and produce their Glossary definition in a tooltip when hovered over with the cursor or otherwise selected (e.g. touchpad, touchscreen, clicked, etc.). Tooltip should disappear when no longer hovered or if any other area of the screen is selected/activated.

Note: Output should not include the term or phrases actual Glossary listing. The LTI Glossary is a reference guide for translators, proofreaders & supporters developed from RBE materials, rather than a separate RBE material itself.
Could/should be expanded to provide catalog lookup by subject (e.g. Transportation, Education, Bio-engineering, Cybernetics, Artificial Intelligence, Biology, Brain, etc.) to facilitate further learning of a given topic. This could be linked to the member portal’s Weblinks feature or an advanced development of that.
Perhaps even displaying related materials by order of their comprehensiveness and/or best order of study for learning (i.e. start with this, then move onto that, etc.)
Signature from »Ray« Earth For Sale:
Slightly Used; inquire within

3 registered users and 40 guests thanked already.

Users who thanked for this post:

smartalgorithm, benzaiten, Solehin Bin Sabtu

  • "Ray" is male
  • "Ray" started this thread

Posts: 1,083

Date of registration: May 23rd 2011

Language Team: Global

Focus Group: LTI Administration Group

Location: Michigan, US

Thanks: 41429 / 6380

  • Send private message

2

Tuesday, April 1st 2014, 5:27am

Re: Fully searchable, Index database of all materials across all languages

As the overall RBE community has grown, it has become increasingly apparent that the needs & integrations for this project have morphed from it being a separate app to one that is integrated within our custom PMS (currently in alpha state), which is currently designed to handle the movement of materials through the various transcription, proofreading, translation & final review before release efforts (i.e. who is doing what and how far along is each project across all languages).

The PMS is also undergoing a major expansion of scope as we are now looking to have it also take on the handling of automation integration across as much of the global RBE resources as we can stuff into it. I will soon create a new thread describing this much needed integration and attempting to pull together all related development projects & support info toward that end goal.
Signature from »Ray« Earth For Sale:
Slightly Used; inquire within

2 registered users and 37 guests thanked already.

Users who thanked for this post:

benzaiten, Solehin Bin Sabtu

Posts: 14

Date of registration: Nov 22nd 2014

Language Team: Global

Focus Group: LTI Development Group

Location: UK

Thanks: 149 / 4

  • Send private message

3

Saturday, November 22nd 2014, 6:48pm

I've been investigating the use of Elasticsearch for this purpose. I took it on a while ago but my time has been dominated by other stuff so not much happened other than proving the concept.

Anyway, a progress update. I've finally managed to write something (in Perl because that's what I'm used to lately as a result of my day job). The script reads the tables of official videos from http://wiki.linguisticteam.org/w/Video_Repository and imports their metadata and their English subtitles (focused on English for the time being) into a virtual machine (Ubuntu, 1GB RAM, 1CPU) running Elasticsearch (currently on my laptop only). There is still work to do to clean up some of the scraped content amongst other bigger todos but I wanted to check the search functionality for a database containing more than 2 videos and their subtitles (what the proof of concept consisted of!) so it'll do for now. The subtitles are stored as attachments. A preliminary search on keywords (using a JSON aware front interface for Elasticsearch - http://sense.qbox.io/gist) such as RBE, creativity, behaviour, humanity, Venus, Fresco, Zeitgeist yield promising results from the title, description and file contents (along with the timestamp at which the word is found).

An index (equivalent to a database) containing data for 158 videos is taking up 15.3MB of disk space. This hasn't been tested for performance or optimised but "does the job" for a prototype!


I've been investigating the use of Elasticsearch for this purpose. I took it on a while ago but my time has been dominated by other stuff so not much happened other than proving the concept.

Anyway, a progress update. I've finally managed to write something (in Perl because that's what I'm used to lately as a result of my day job). The script reads the tables of official videos from http://wiki.linguisticteam.org/w/Video_Repository and imports their metadata and their English subtitles (focused on English for the time being) into a virtual machine (Ubuntu, 1GB RAM, 1CPU) running Elasticsearch (currently on my laptop only). There is still work to do to clean up some of the scraped content amongst other bigger todos but I wanted to check the search functionality for a database containing more than 2 videos and their subtitles (what the proof of concept consisted of!) so it'll do for now. The subtitles are stored as attachments. A preliminary search on keywords (using a JSON aware front interface for Elasticsearch - http://sense.qbox.io/gist) such as RBE, creativity, behaviour, humanity, Venus, Fresco, Zeitgeist yield promising results from the title, description and file contents (along with the timestamp at which the word is found).

An index (equivalent to a database) containing data for 158 videos is taking up 15.3MB of disk space. This hasn't been tested for performance or optimised but "does the job" for a prototype!


Running Work List
2014-11-22: Targets for next time:
  • *DONE* Tidy up import script and put on GitHub
  • *DONE* Document how to set up prototype
  • How to write phrase queries

2014-11-30: Targets for next time:
  • *DONE* How to write phrase queries (carried over from last time due to other tasks taking longer than expected)
  • *DONE* Transfer repository to linguisticteam organisation (permissions required)
  • *DONE* Check dupes aren't added
  • *DONE* Where no original link exists, fallback to the English language link or as a last resort, the first language in the list.

2014-12-07: Targets for next time:
  • *DONE* Fix Parsing of undecoded UTF-8 will give garbage when decoding entities at ...perl/vendor/lib/HTML/PullParser.pm...
  • *DONE* Prototype how to generate time coded link to video at search term(s)

2014-12-28: Targets for next time:
  • *DONE* Tidy up search script and put on GitHub
  • *DONE* Learn how to use GitHub enough for tracking code changes
  • *DONE* More testing, checking, tidying up

2017-05-28:


Next steps (on hold for now)
  • Fix "Parsing of undecoded UTF-8 will give garbage when decoding entities..." warning
  • Try importing subtitles for a different language
  • Generate time coded link to video at search term(s)
  • Show results from terms entered in a search text box
  • Add support for multiple languages (will need to be broken up language by language)
  • Evaluate other subtitle formats (currently using SubRip srt with extended highlighting results to get full time codes)

This post has been edited 3 times, last edit by "jyomaj" (Jun 3rd 2017, 7:22pm)


2 registered users and 34 guests thanked already.

Users who thanked for this post:

Ray, Solehin Bin Sabtu

Posts: 14

Date of registration: Nov 22nd 2014

Language Team: Global

Focus Group: LTI Development Group

Location: UK

Thanks: 149 / 4

  • Send private message

4

Saturday, June 3rd 2017, 7:23pm

Recap on requirements from Ray ~06/11/2016

The intention is to provide the entire world with a single place to look up a variety of things, and get the results back in a variety of formats. For example:

In what materials can I find the following keywords? Give them back to me with the timestamps or page numbers included.

How far along is the Arabic translation of [any project]? And tell me how I can join the effort.

How many people are currently part of the Serbian Team? And tell me how to join.

Where can I find the official public distribution of [any project]? And tell me if there is one specific to my language.

etc., etc., etc..

Does the [any language] Team have an active Twitter (or Facebook, etc.) account? Take me there so I can subscribe to it.

Consider that every team will have its own distribution and announcement channels (YouTube, Facebook page, Twitter account, etc.) along with their own set of team resources (team glossary, progress report spreadsheet, etc.), so these kinds of questions will be drawing on our record of their locations, as well as how the team relates to each individual project (tying into a master PMS that tracks the progress of each project across all teams).
Many of these resources already exist to some degree for some of the teams, but the system will obviously require consistency across the teams for such an approach to work smoothly.

Current Status

Search prototype has been upgraded to use Ubuntu 17.04 and Elasticsearch 5.4.

An index (equivalent to a database) containing data for 304 videos takes up 40.3 mb. This hasn't been tested for performance or optimised but it does the job for a prototype.

1 registered user and 7 guests thanked already.

Users who thanked for this post:

Solehin Bin Sabtu

© Linguistic Team International 2018
Context In Motion