2020_CSE_03 BLOGS

Welcome to our blog series where we share our journey in building a chrome web browser extension which can read, compose and organize a user’s Gmail inbox all controlled through his voice. We will be updating our findings and progress every 2 weeks through this series. 
















x

So what exactly are we working on?

The plan is simple, over the course of next few months we are aiming to develop a chrome extension which acts as a Voice Assistant for your Gmail account. Our main goal is to make it so user friendly that even visually blind users can access their Gmail accounts without any hassle.

Why?

As we all are witnessing the importance of technology and it’s services during the pandemic, we realized that many of the services available today mainly rely on an email client for communication of important information, thus many visually impaired users face a challenging time in accessing these technologies in their day to day life. In an attempt to make the internet a more accessible place, we came up with EVA, an Email Voice Assistant (Thus the cute abbreviation) which we hopefully can contribute to solve this accessibility issue. 


How are we planning on doing this?

The extension will be using speech to text , text to speech and voice recognition modules to process the user’s

voice input


Speech to Text Module:

           Speech to text conversion is the process of converting spoken words into textual format.. Speech to text is a conversion module which relies on an acoustic and a language model which helps to synthesize the vocal commands and extract a textual information.

Text to Speech Module:

            Converting text to voice output using speech synthesis techniques. Text-to-speech is a type of speech synthesis technique that is used to create a spoken sound version of the text in a computer document. Text to Speech can enable the reading of computer display information for the visually challenged person.

Voice Recognition Modules:

            Voice recognition is a technique that facilitates natural and convenient human-machine interface using the voice recognition module. It extracts and analyses the voice of a human delivered to a computer through the mic.

But, how are we going to make it an interactive assistant? Keep following this series and find it out with us

Let’s get some terms laid out before going ahead:

Extensions: Extensions are small software programs that customize the browsing experience.

They enable users to tailor Chrome functionality and behavior to individual needs or preferences. Some of the most famously used extensions are uBlock Origin, Facebook container, honey, HTTPS everywhere.

Chrome APIs: Chrome web browser provides APIs for developers to build extensions which can, for example: Run a timer (like a pomerado clock), block ads etc.

Speech Synthesis: Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech computer or speech synthesizer, and can be implemented in software or hardware products

……

First week:

     We created basic chrome extensions and enhanced our knowledge about JavaScript. Various YouTube videos and a few Udemy courses were referred to build a basic extension. You can refer to this video to get the gist of it.

Second week:

We tried pair "programming" as follows

   Vishal and Supreetha :  We created a speech to text extension which took speech input from the user then store it in chrome storage and retrieve the same. Learnt about the intricacies of the chrome extension such as message passing, background.js, content.js, chrome storage.

 Learnt about the intricacies of the chrome extension such as message passing, background.js and  learnt a bit of web scraping to scrape title and paragraphs from a webpage. 

To understand the working of extensions (specifically on chrome) it's always better to refer to the official documentation here.

Shreyas and Vybhavi : Read through the working of Gmail APIs, relation of changes in domain name of email addresses, speech synthesis implementation differences in chrome on windows and linux.

Third week: (Till 26/04/2021)

Vishal and Supreetha : Opening a tab through voice. Initially we tried to use chrome storage to store user voice input and display it in popup.html, later we figured out that can be done through message passing. 

Shreyas and Vybhavi : Researched about OCR, screen-reader.  In the course of finding a proper text to speech implementation example to which we could refer to, we accidentally found out an implementation difference on TTS API of the chrome browser across Linux and windows.  Basically, while entering a text to the speech synthesizer it should ideally output the contents given in the text box. But, if you enter a HTML tag, on chrome the tag syntax is identified and ignored from the output speech while on Linux the tags are not identified and thus the output speech consists of it.

You can test it out on this site, while entering a text input, enter a HTML tag like "<br>" or so and see the difference across windows and Linux.

So far we could open the Gmail tab through voice.

 

Fourth week: (2/05/2021)

 We started using the Gmail APIs to access a user's inbox data, with this we could display it on a simple html file and used SpeechSynthesis to read out the "from", "subject", "time" and "email's textual body content" of the first email in the user's inbox. 

We referred to Gmail's official quickstart guide and this website to retrieve data. 

Some of the challenges we are currently facing:

  • Earlier the speech synthesis modules were working as per requirement, but now there seems to be a problem with it's functioning as it does not work as it used to and does not work if we reload a tab.
  • In the quickstart guide given in the official documentation of Gmail, they use a simple http server for the functioning of the APIs, we are quite not sure why and what is reason behind this. 
  • Understanding web scraping through JS has been quite a pain in the development environment because of it's complex nature.
Spoilers for next week: We have been looking into artyom.js for better speech recognition and TTS handling, hopefully this has better working mechanism.


If you liked our journey so far then do stick around and comment below if you have any suggestions, catch you in the next blog update. 

Comments

  1. Try to be gender neutral as much as possible, (instead of "his/her", may be you can use "user" etc.)
    EVA can also be used by lazy users (in addition to visually challenged people), who would like to avoid using keyboard, e.g. driving in a car etc/

    Lay out the plans for future and what one can expect.

    ReplyDelete
  2. Don't edit the exist blog for subsequent blogs. The same link should work, but it should display the blog contents date/monthwise.

    ReplyDelete
  3. the webserver (python -m SimpleHTTPServer 8000) is required to load the web page index.html which contains the sample javascript. The javascript contains some onload function which are executed when page is loaded. So, if your javascript is part of some webpage which you are accessing from any web server, then you don't need the local webserver.

    ReplyDelete

Post a Comment

Popular posts from this blog

2020_CSE_03 Week 09

2020_CSE_03 WEEK 10