From Fragments to Foundation: The Sound of Progress in Edge Audio AI Artwork

EDGE AI POD

Discover the cutting-edge world of energy-efficient machine learning, edge AI, hardware accelerators, software algorithms, and real-world use cases with this podcast feed from all things in the world's largest EDGE AI community.

These are shows like EDGE AI Talks, EDGE AI Blueprints as well as EDGE AI FOUNDATION event talks on a range of research, product and business topics.

Join us to stay informed and inspired!

All Episodes

EDGE AI POD

From Fragments to Foundation: The Sound of Progress in Edge Audio AI

March 26, 2026 • EDGE AI FOUNDATION

0:00 | 29:18

What if your printer didn’t just spit out pages, but actually understood them? We walk through a hands-on look at multimodal AI on the edge—how visual-language models read layouts, extract tables, translate content, and reformat documents right where data lives, without shipping sensitive files to the cloud. It’s a practical tour from passive peripherals to active intelligence, with real workflows and measurable speedups.

We share the architecture behind on-device document intelligence: pre-processing that stabilizes inputs, VLMs that localize and reason over text and images, and post-processing that converts outputs into CSVs, charts, and accessibility-friendly layouts. You’ll hear how Qwen 2.5-VL handles complex visual inputs while maintaining strong language performance, and how a Flux-based diffusion setup enables creative generation and targeted edits—from updating dates in greeting cards to changing borders and colors by prompt. Along the way, we unpack quantization with GGUF to run 7B-class models in tight memory, diffusion sampler and scheduler tuning for latency, and NVIDIA-optimized libraries to squeeze more from modest GPUs.

Beyond demos, we dig into business and engineering realities: fine-tuning with enterprise data to reduce hallucinations, building guardrails and fallback paths for reliability, and segmenting large documents to manage VRAM. We also discuss why a companion device—AI PC or smartphone—can orchestrate heavy lifting until printer SOCs catch up, keeping data private and workflows responsive. If you care about document AI, privacy by design, or accessibility features like dynamic type and contrast, this conversation makes the path concrete and actionable.

Enjoy the deep dive? Subscribe, share with a colleague who lives in PDFs, and leave a review with the one edge use case you want us to test next.

Send us Fan Mail

Support the show

Learn more about the EDGE AI FOUNDATION - edgeaifoundation.org

SPEAKER_00 0:22

I think you are going to speak about uh industry applications and the deployments uh of Genai D edge. So whenever you feel ready to share your presentation, please uh please do so. And uh and the floor is yours. Is it visible?

SPEAKER_01 0:42

Thank you. Thank you so much, Danilo. Is my presentation visible? Yes, it is. It is. Oh, okay. Thanks. Thank you so much. Yeah. Uh thank you, Danilo. So um uh today uh so my name is Anir Ban, and uh I am representing Vipro Limited today um to talk about some of the real-world uh deployment and industry applications that we are seeing. Uh this is the agenda uh for my presentation. So some background on where uh we come from. Uh what is the AGI uh relevance that we are seeing?

Slide-Sharing Hiccups Resolved

SPEAKER_02 1:25

Sorry, Anirban. We don't, I think you are maybe it's not another presentation you're showing because we just see your tile slide right now. I don't know exactly if you want to try again. Because I I think it's just the one that you have uploaded previously, but you're not sharing yourself.

SPEAKER_01 1:49

Let me share that. Is it visible now?

SPEAKER_02 2:07

Um let me see. No, I think it's still the other one. I think you or maybe now. Let's see, is that the one? And then you just go full screen.

SPEAKER_01 2:17

Yeah, table of table of content.

SPEAKER_02 2:20

Yeah, table of content. Yeah, and now you just need to go to full screen because we just have all your slides on the left. And then I'll let you take it over.

SPEAKER_01 2:31

I've gone to full screen.

SPEAKER_02 2:36

No, it's not. We are still seeing everything on the left to all your slides. No, are you using different screens maybe? Or when you just go to a slideshow.

SPEAKER_01 3:05

Okay, let me share again.

SPEAKER_02 3:33

No, it's not visible in the screen. If not, yeah, we just have to go that way. And if you just go and take slide show that it's showing your whole screen. Right, Danilo, you can also see that, right? It's not in full screen mode.

SPEAKER_00 3:57

He is just uh I see no in no not in uh full screen, no, no.

SPEAKER_02 4:05

Okay.

SPEAKER_00 4:07

No, okay. Any technical difficulty, Arniban?

SPEAKER_02 4:12

Yeah, if not, yeah, we have to go that way, but it's not uh really what we want, but it's okay.

SPEAKER_01 4:18

I'm sharing in full screen, but not sure uh yeah.

SPEAKER_00 4:24

So maybe you have a multiple screen.

SPEAKER_01 4:26

You can share the presentation that I uploaded.

SPEAKER_02 4:30

Yeah, I can share that one, but that's the same.

SPEAKER_01 4:32

No, I don't want that actually. Uh without the video, it's fine. I think we will continue.

SPEAKER_02 4:39

Okay, then we go with this one. Okay, can you move forward here in that slide?

SPEAKER_01 4:43

And then I'll yeah, you can go ahead uh to the next slide.

SPEAKER_02 4:50

Okay, so you want me to okay, okay. I'll just follow you then.

Why Multimodal AI at the Edge

Printers as Smart Endpoints

Models for Document Understanding

System Pipeline on Devices

Two Flagship Models Used

Use Cases: Tables to Charts

Generating and Editing Images

Auto Classification & Accessibility

Deployment Optimizations

Remaining Challenges

Conclusions & Future Direction

SPEAKER_01 4:57

Yeah, yeah, so uh this is the um agenda and then um uh some background on our uh team. So uh for those um who are not aware, so Vipro uh is a leading global information technology consulting and uh software development company. Uh we are present in more than 167 countries and we focus on different areas, including AI and um cloud computing and data analytics, for example. So um I come from a background of embedded development and uh currently for the last five years working on AI. Um, and I would like to acknowledge the bigger team uh behind this as well. Yeah, we can go to the next slide. Thank you. So, to talk about the use cases, um so what we see uh are uh different trends in AI. Um so there are advancements in the generative AI models uh with a lot of multimodal AI models coming in, and uh these uh models are driving new use cases as well as uh they are enabling improvements in the older implementation methods. A lot of the implementation for multimodal AI uh happens on the cloud, but uh there are certain concerns with that in terms of mainly the data privacy part of it, where people may not be ready to share all their information to the cloud as well as the uh need for the availability of a high bandwidth network. So, in some use cases, it is uh beneficial to have the inferencing done in an offline mode without a dedicated network connectivity. So, uh, in terms of looking at um embedded devices, uh we looked at some of the different uh devices and we uh saw that like the printers and scanners is one category whether where there may be a lot of uh applications of multimodal AI. Today the printers and scanners that we see are mostly passive, they don't uh basically uh process the content themselves, they struggle with a lot of uh inconsistent formats and layouts, and uh basically these uh formats are all unstructured. Uh, a lot of the documents come in multilingual formats, in specially in enterprise context. And uh with the traditional rule-based approaches and um even traditional deep learning techniques, uh, there is very limited reformatting support which is available. So, with the coming of multimodal AI, um basically these uh passive endpoints can become more intelligent, and uh they can help in uh basically changing these unstructured formats into a more structured format. Uh they have the power to translate the content uh from one language to other, and uh they have this special ability to understand both image and the text. Uh we can go to the next slide. Yeah. So um we took one use case uh mostly on the document uh understanding part of it. So a lot of uh multimodal AI models are available uh even in the open source. For example, we have the Gwen 2.5 VL, uh which is a very popular model. It is capable of handling text, images, and even other modalities. Uh we have very uh special uh models like layout LM V3 from Microsoft, which is uh basically uh trained for document understanding. Then there are other popular models like Donut and TATR from Microsoft for table understanding. And these are the multimodal series of models, whereas we have another series more from like the diffusion models. So we have a model from Flux, so uh they are very creative in uh generating new content. We can go to the next slide. Uh this is our uh solution approach that how um on the printers and scanners basically we can bring in uh these multimodal AI models. So we have this input uh image or PDF formats, as well as uh we may have an user input that may be in the form of a prompt or it can be uh by voice directions. So once this is uh acquired, we do some conversion, maybe we convert PDF to image, then we do a lot of uh pre-processing tasks like we do resolution conversion, we do page segmentation, uh, noise reduction, and so on. And then pretty much we can use the visual language models directly. Uh for uh optimized inference, we can do quantization and uh without much supervision, the VLMs are capable of uh interpreting the content, identifying key elements, and then uh restructuring the layouts. Uh they can they are also capable of uh generating in the format that we desire, and then we have some post-processing activities where we clean up uh prepare in the format that we desire basically. Uh, the interesting part of this is uh there is uh without even doing uh lot of fine-tuning of the base foundation models, we can get uh quite decent uh results. So, primarily in our implementation uh we have used uh two models. Uh so one is the Quen 2.5 VL model, uh, it excels in handling complex visual inputs, including images of different sizes and so on, while still maintaining a good uh uh linguistic performance. Uh this model uh so we have tried with 3 billion parameters, 7 billion parameter, and so on, and um, the configuration that we used had a GPU RAM of 16 GB basically. Uh on the other hand, we have also used the flux diffusion model. Uh, this uh has a hybrid architecture of both uh multimodal and parallel diffusion transformer blocks, and uh it had a size of around 12 billion parameters. Uh so this also needs uh uh VRAM of around 24 GB. Uh we can go to the next slide. So um here we have uh primarily two use cases. Um is on extracting the tabula data and reformatting to the graphs. So we can import a PDF document which has tables and other textual content. So the VLMs are um uh quite skilled to extract the table information, and then once we have the tabular data, we can convert it into a different representation, say a pie chart or a bar chart, and uh then uh we can overwrite say the tabular data with this information. So a lot of these creative uh things can be done uh with the uh documents. Uh the next use case is more on image generation and modification. So basically, we were thinking about uh greeting cards use case. So with a textual prompt, uh we can um uh create a greeting card uh from the model, and then we can do incremental modification text uh steps by prompting the model. So for example, here um uh the greeting card says happy new year 2023, and then uh with a proper text prompt we can modify the year to 2025. So uh some of these modifications are done very simply. Uh, then in the third step, we wanted to change the outer boundary of the image. So we were able to do it with a simple prompt to the model. You can go to the next slide. This is another uh use case where uh we um uh do the text correction which I already discussed. Then um coming to other use cases. So, what we have seen is uh in a printer scanner use case, um automatically we can understand the scan documents and then classify them uh into specific categories. Uh so this is uh real intelligence. So in an enterprise context, it really helps that uh it can also trigger a notification that uh this document is intended for this group of people. Then we have multilingual content handling ability, and um uh so these models can easily translate between different languages, uh, and this is very helpful in typical enterprises. Another use case may be like adapting the layout for visually impaired users by increasing the font size, contrast, and simplifying the design. Then in a scanner use case, we can say have uh some defects in the text, and uh all of those gaps can be automatically filled in by the model, and error correction can be done automatically. We have seen other use cases like um uh when it comes to giving a print from a web page, like a lot of these advertisements and other content comes in. So all of that can be automatically detected by the AI model and removed. Yeah, we can go to the next slide. Yeah, yeah. Unfortunately, I think this um uh video will not run, but uh basically while we were showing the uh table extraction. This is the text correction uh part of it. Yeah, you can go to the next slide. This is about changing the borders, and uh this is uh about changing the color of the bird by just a simple prompt to the model. Uh it is able to update the bird color. Yeah, you can go to the next slide. Yeah, so this is some of the deployment aspects uh what we considered. So uh the aim is to uh deploy it on the edge and the device, like we discussed, but uh these models still need uh uh considerable resources to run. So uh first of all, we took a lot of steps in optimizing it. Um, one step was basically downscaling the image. So during the pre-processing step, we downscaled the image uh by reducing the resolution significantly so that it improves the image processing performance efficiency uh with some loss of uh accuracy. Then these VLM models are very good in uh object localization and grounding. So some of the tasks, such as accurately locating the objects within the images, using bounding boxes and point coordinates, so all of these can be done automatically for the VLMs, and as an application developer, we don't need to set up some complicated pipeline to do this. So we use the power of the VLMs which are in built. In terms of model quantization, um, we uh quantize the 7 billion parameter model using a 4-bit quantization technique, and uh we use the GGUF format for the model. Uh, GGUF is a more efficient and flexible way of storing and using the uh models for inference and is designed to run well on uh consumer create computer hardware. In terms of the diffusion model, we um did a lot of hyperparameter optimization. So there are certain parameters for the diffusion model like case sampler and scheduler, and these can be tweaked for better performance by enabling better control of the sampling process, how the image gets generated, and uh by fine-tuning the sampling algorithms and parameters, uh, we could do a balance between the accuracy as well as the performance. And last but not the least, we utilize the optimized library for Nvidia for the GPU to optimize the performance on the GPU. So overall, like um the deployment configuration was based on an X66-based processor, an NVIDIA uh T4 GPU with a memory of around 16 GB. Uh, all the mentioned steps that we took helped in reducing both the inference time and the memory footprint for deployment on typical printer or a companion computer SOCs. So, um also the point of view that we have is um it may still take time to deploy such kind of models directly on the printer, but we also have the companion devices to the printer like the AIPC or the smartphone, which themselves are quite capable. So some of these things can be orchestrated between both, and some things may also be done on the cloud. Yeah, we can go to the next slide. Yeah. Um, so these are some of the challenges that uh still remain in terms of uh large document handling. Um, so we need to do a lot of segmentation tasks and batch processing to manage the memory load. Um, then though we did not uh really do a fine-tuning of these base models, but uh for the optimum accuracy, uh we would need to do uh fine-tuning uh per the use case, and uh that will help us in maintaining the performance. Uh, when it comes to embedded deployment, we have to keep in mind the thermal and the power constraints using efficient scheduling and hardware acceleration uh to minimize the power consumption. And last but not the least, the compute requirements are still high, but um, as the SOCs develop and we also get better methods of optimization, uh, I think the deployment on the target lower cost devices would be a reality. We can go to the next slide. So these are some of the conclusions and future directions. So uh multimodal AI models are a transformative advancement, uh, especially if we look at document reformatting tasks in printers. And by deploying such models directly either on the device or on a companion device, we can offer customers much better and smarter printing and scanning solutions. Multiple innovative use cases can be accomplished by using these models, and uh this approach will set the stage for a new era of intelligent edge printing where both uh the content understanding and the deformatting can happen seamlessly at or near the point of output. Yeah, so I think with this we uh come to the end of the presentation. Yeah, and I would be happy to take some questions.

Q&A: Demos and Development Speed

SPEAKER_00 23:32

Uh thanks, uh thanks, Arniban. Uh and uh yeah, your presentation uh was uh smooth uh and uh maybe we missed your video. I don't know if you could find a way to share uh links to the video later on so that we can enjoy your demonstration. Can you quickly comment about what the demo were uh uh about? Sorry, can you repeat your question? Um uh about what uh your demos uh were uh were about uh okay okay okay yeah yeah what was the video correct correct correct the demos were about some uh very simple examples for example from a document like uh we can do a table extraction, right?

SPEAKER_01 24:28

So uh if you think of uh earlier times, right, we would have to use some very specific template-based approach for doing this extraction, or we would have to train a deep learning model using a lot of data, right? But with this VA visual language models, the advantage is that a lot of the knowledge uh for these kind of tasks are already inbuilt in those foundation models. So um, and they realize the both the imaging context as well as the prompts that the user gives, right? So it can relate. So when I say that okay, you extract this table uh and just uh output it into some CSV format, so the model itself has that capability to do it. So this really uh would shorten the um uh development time for application developers, and um I think that is the one of the key areas to look forward to, and uh uh Arniban, and uh yeah, thanks thanks Arniban.

Q&A: Business Viability & Guardrails

SPEAKER_00 25:41

A curiosity also from uh a business perspective. How do you uh how do you consider um about using a conditional model for a solution that you sell in the market? After all, the foundational model uh, I mean uh you you adopt, but uh maybe someone else uh uh prepared, trained uh with with lots of data, and uh and and then you're trying to exploit, for example, for the purposes that you spoke about uh and uh and uh how how do you um manage such a solution which depends from foundational model?

AI for Accessibility & Impact

SPEAKER_01 26:33

Is uh is something that you consider viable, uh reliable, or do you see any any any counts about that uh yeah, so I think uh people have concerns on the foundational model, like what kind of data sets that it has been trained on, and so on. So I think uh we should look at it uh use case by use case. Um so uh I think the power uh comes from the data for AI. Um and um just using the foundational models may not be uh appropriate for any commercial solution. So we really need to fine-tune the data with uh uh whatever contextual data belongs to the enterprise, right? And um ultimately the model may still hallucinate, right, at the end of the day. And so we need to build in a lot of guardrails um so that if it really does, then it should have an uh exit path and it should uh give something which is more logical, right? So I think a lot of the work with AI will um be with uh how you write the prompts, how you basically experiment with it, how you see what it is doing, and building some checks on top of it, right? And uh still uh it would be a kind of a fuzzy system, meaning uh you can never predict that it would repeat the response every time, right? But uh basically it still gives a benefit in certain use cases. I think uh that is the balance we need to draw, like where it can be applied and where it cannot be applied.

SPEAKER_00 28:22

So and uh and I like the you mentioned that that this technology can help uh, for example, visually impair the people. Uh so you made me think that this uh great technology can uh do help uh to create new services and uh products and applications, but also serve uh uh for uh the needs uh serve the needs of uh people with uh less abilities uh and uh which can really uh enjoy the capability of GNI in uh in providing machine interaction at uh at the next level. So thanks to underline also this possibility.