Computer Vision

Image recognition, video understanding, multimodal AI

223 articles across 73 editions

Articles

  1. Meta's AI smart glasses and data privacy concerns -- 2026-03-03
  2. [Editorial] AI Search Index -- 2026-03-03
  3. Qwen 3.5 vs Gemini 3 Pro on Screenshot-to-Code: Is the gap finally gone? -- 2026-02-19
  4. [Editorial] The evolution of vision models from CNNs to transformers -- 2026-02-19
  5. Show HN: Offline tiles and routing and geocoding in one Docker Compose stack -- 2026-01-05
  6. ostris/Z-Image-De-Turbo -- 2025-12-12
  7. mistralai/Ministral-3-8B-Instruct-2512 -- 2025-12-12
  8. Making Glasses That Detect Smartglasses -- 2025-12-11
  9. ByteDance-Seed/Depth-Anything-3 -- 2025-12-10
  10. SARLO-80: Worldwide Slant SAR Language Optic Dataset at 80 cm Resolution -- 2025-12-10
  11. Optical Context Compression Is Just (Bad) Autoencoding -- 2025-12-10
  12. seominseok0429/Upsample-Anything-A-Simple-and-Hard-to-Beat-Baseline-for-Feature-Upsampling -- 2025-12-09
  13. Vision-and-Language Navigation with Analogical Textual Descriptions in LLMs -- 2025-12-09
  14. lrzjason/QwenEdit-Anything2Real_Alpha -- 2025-12-08
  15. How Big is Your Video Again? Square vs Rectangular Pixels -- 2025-12-08
  16. ByteDance/BindWeave -- 2025-12-08
  17. 3D Gaussian and Diffusion-Based Gaze Redirection -- 2025-12-04
  18. apple/starflow -- 2025-12-04
  19. princepainter/ComfyUI-PainterLongVideo -- 2025-12-04
  20. OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing -- 2025-12-03
  21. Introducing GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization | "GeoVista is a new 7B open-source agentic model that achieves SOTA performance in geolocalization by integrating visual tools and web search into an RL loop." -- 2025-11-28
  22. facebook/sam-3d-body-dinov3 -- 2025-11-28
  23. Diffusers welcomes FLUX-2 -- 2025-11-26
  24. marinero4972/Open-o3-Video -- 2025-11-17
  25. nanonets/Nanonets-OCR2-3B -- 2025-11-17
  26. nvidia/ChronoEdit-14B-Diffusers-Upscaler-Lora -- 2025-11-17
  27. We ran over 600 image generations to compare AI image models -- 2025-11-13
  28. dx8152/Qwen-Image-Edit-2509-Relight -- 2025-11-13
  29. Last week in Multimodal AI - Local Edition -- 2025-11-12
  30. DeepSeek-OCR GGUF model runs great locally - simple and fast -- 2025-11-12
  31. Qwen3-VL works really good with Zoom-in Tool -- 2025-11-12
  32. lightonai/LightOnOCR-1B-1025 -- 2025-11-12
  33. Qwen/Qwen3-VL-2B-Thinking -- 2025-11-12
  34. Why does Image Recognition work in llama-server but not through Open WebUI? -- 2025-11-06
  35. Has anyone tested ollama on Whisplay HAT with Raspberry pi zero 2W? -- 2025-11-06
  36. allenai/olmOCR-2-7B-1025 -- 2025-11-06
  37. is there simple way like .bat to compress to q4-q8 like Unsloth, Qwen3-VL-30B-A3B-Thinking-abliterated model -- 2025-11-06
  38. meituan-longcat/LongCat-Video -- 2025-11-05
  39. allenai/olmOCR-2-7B-1025-FP8 -- 2025-11-05
  40. Worse Embedding Performance with Qwen 3 VL than with Qwen 2.5 VL? -- 2025-11-05
  41. Retrospective Sparse Attention for Efficient Long-Context Generation -- 2025-11-05
  42. deepseek-ai/DeepSeek-OCR -- 2025-11-04
  43. LiquidAI/LFM2-VL-3B -- 2025-11-04
  44. Qwen/Qwen3-VL-235B-A22B-Thinking -- 2025-11-04
  45. KTransformers Open Source New Era: Local Fine-tuning of Kimi K2 and DeepSeek V3 -- 2025-11-04
  46. F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data -- 2025-11-04
  47. [Editorial] https://blog.peerllm.com/2025/11/02/announcing-v0.7.6.html -- 2025-11-04
  48. Myths Programmers Believe about CPU Caches -- 2025-11-04
  49. Pi Zero Powers A Little Indoor Rover -- 2025-11-04
  50. [Editorial] https://commsrisk.com/sms-blaster-and-imsi-catcher-news-from-lebanon-cambodia-switzerland-and-the-philippines/ -- 2025-11-03
  51. An Obscure Military Program Helps Local Cops Buy Armored Card and Spyware -- 2025-11-03
  52. DeepSeek may have found a new way to improve AI’s ability to remember -- 2025-11-02
  53. Qwen/Qwen3-VL-8B-Thinking -- 2025-11-02
  54. nvidia/omnivinci -- 2025-11-02
  55. OpenImagingLab/FlashVSR -- 2025-11-02
  56. Build Your Own Force-Feedback Joystick -- 2025-11-02
  57. DarkBitx/ICRev -- 2025-11-01
  58. dd1100/DiscordRAT -- 2025-11-01
  59. Police used Flock cameras to accuse a woman of theft, she had to prove innocence -- 2025-11-01
  60. ZOZO's Contact Solver for physics-based simulations -- 2025-11-01
  61. valiantcat/Qwen-Image-Edit-MeiTu -- 2025-11-01
  62. ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing -- 2025-11-01
  63. DeepSeek just released a bombshell AI model (DeepSeek AI) so profound it may be as important as the initial release of ChatGPT-3.5/4 ------ Robots can see-------- And nobody is talking about it -- And it's Open Source - If you take this new OCR Compresion + Graphicacy = Dual-Graphicacy 2.5x improve -- 2025-10-27
  64. Pico Banana: Large-Scale Dataset for Image Editing by Apple -- 2025-10-27
  65. What's the best embedding model for document images ? -- 2025-10-26
  66. AlphaXiv,Compare the Deepseek-OCR and Mistral-OCR OCR models -- 2025-10-26
  67. Open-Bee/Bee-8B-RL -- 2025-10-26
  68. datalab-to/chandra -- 2025-10-26
  69. Unlock the power of images with AI Sheets -- 2025-10-26
  70. dvlab-research/DreamOmni2 -- 2025-10-25
  71. bytetriper/RAE -- 2025-10-25
  72. tencent/POINTS-Reader -- 2025-10-25
  73. Stitch: Training-Free Position Control in Multimodal Diffusion Transformers -- 2025-10-25
  74. mit-han-lab/streaming-vlm -- 2025-10-22
  75. Doby-Xu/WithAnyone -- 2025-10-22
  76. lightx2v/Wan2.2-I2V-A14B-Moe-Distill-Lightx2v -- 2025-10-22
  77. tencent-ailab/SongPrep -- 2025-10-20
  78. opendatalab/MinerU2.5-2509-1.2B -- 2025-10-20
  79. QuantStack/Qwen-Image-Edit-2509-GGUF -- 2025-10-20
  80. LM Studio and VL models -- 2025-10-19
  81. Qwen/Qwen-Image-Edit-2509 -- 2025-10-19
  82. Alpha-VLLM/Lumina-DiMOO -- 2025-10-19
  83. Paper2Video — turn a research paper into a full presentation video (slides, speech, talking head) -- 2025-10-15
  84. Practical OCR with Nanonets OCR2‑3B -- 2025-10-15
  85. neuphonic/neutts-air -- 2025-10-15
  86. Qwen/Qwen3-VL-235B-A22B-Instruct -- 2025-10-15
  87. XiaomiMiMo/MiMo-Audio-Eval -- 2025-10-15
  88. Very interesting! OmniInsert — mask-free video insertion of any reference -- 2025-10-14
  89. facebookresearch/DepthLM_Official -- 2025-10-14
  90. NVlabs/rcm -- 2025-10-11
  91. Divining Air Quality With A Cheap Computer Vision Device -- 2025-10-09
  92. pixai-labs/pixai-tagger-v0.9 -- 2025-10-09
  93. TianDongL/Diffusion_pipe_in_ComfyUI -- 2025-10-09
  94. Project running VLMs on a Pi 5 and NV Jetson Orin Nano -- 2025-10-05
  95. Demo: I made an open-source version of Imagine by Claude (released yesterday) -- 2025-10-05
  96. nunchaku-tech/nunchaku-qwen-image-edit-2509 -- 2025-10-05
  97. cvlab-kaist/VIRAL -- 2025-10-05
  98. deepseek-ai/DeepSeek-V3.2-Exp -- 2025-10-03
  99. moondream/moondream3-preview -- 2025-10-03
  100. Unsupervised Hallucination Detection by Inspecting Reasoning Processes -- 2025-10-03
  101. Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training -- 2025-10-03
  102. Will fine-tuning LLaMA 3.2 11B Instruct on text-only data degrade its vision capabilities? -- 2025-10-03
  103. jmanhype/vggt-mps -- 2025-10-02
  104. openbmb/VoxCPM-0.5B -- 2025-10-02
  105. Comfy-Org/Qwen-Image-Edit_ComfyUI -- 2025-10-02
  106. SOTA OCR on-device with Core ML and dots.ocr -- 2025-10-02
  107. Stress-Testing RAG in Production: Retrieval Quality, Drift, and Hidden Costs -- 2025-09-30
  108. Tencent-Hunyuan/SRPO -- 2025-09-30
  109. lodestones/Chroma1-Base -- 2025-09-30
  110. MV-RAG: Retrieval Augmented Multiview Diffusion -- 2025-09-30
  111. Phantom-video/HuMo -- 2025-09-27
  112. Build Your Own 6K Camera -- 2025-09-27
  113. Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer -- 2025-09-27
  114. OPPOer/Qwen-Image-Pruning -- 2025-09-27
  115. From Research to Reality: Feasibility of Gradient Inversion Attacks in Federated Learning -- 2025-09-17
  116. Renting GPUs is hilariously cheap -- 2025-09-09
  117. Tencent-Hunyuan/HunyuanWorld-Voyager -- 2025-09-09
  118. Shipping textures as PNGs is suboptimal -- 2025-09-09
  119. inclusionAI/UI-Venus -- 2025-09-07
  120. WeChatCV/Stand-In_Preprocessor_ComfyUI -- 2025-09-07
  121. Robotic Canoe Puts Robot Arms to Work -- 2025-09-07
  122. LiquidAI/LFM2-VL-1.6B -- 2025-09-06
  123. A Training-Free, Task-Agnostic Framework for Enhancing MLLM Performance on High-Resolution Images -- 2025-09-06
  124. TencentARC/GenCompositor -- 2025-09-06
  125. Doubling Your Data in Minutes: Ultra-fast Tabular Data Generation via LLM-Induced Dependency Graphs -- 2025-08-30
  126. Temporal Point-Supervised Signal Reconstruction: A Human-Annotation-Free Framework for Weak Moving Target Detection -- 2025-08-30
  127. PurinNyova/Image-Detection-Bypass-Utility -- 2025-08-25
  128. An Alternative to Text-to-SQL -- 2025-08-25
  129. Best model for transcribing videos? -- 2025-08-25
  130. Meta released DINO-V3 : SOTA for any Vision task -- 2025-08-21
  131. DatasetAgent: A Novel Multi-Agent System for Auto-Constructing Datasets from Real-World Images -- 2025-08-21
  132. WeChatCV/Stand-In -- 2025-08-21
  133. We built a 12B model that beats Claude 4 Sonnet at video captioning while costing 17x less - fully open source -- 2025-08-15
  134. Francis-Rings/StableAvatar -- 2025-08-15
  135. nvidia/canary-qwen-2.5b -- 2025-08-15
  136. OmniSVG/OmniSVG -- 2025-08-15
  137. Phi-Ground Tech Report: Advancing Perception in GUI Grounding -- 2025-08-15
  138. [UPDATE] DocStrange - Structured data extraction from images/pdfs/docs -- 2025-08-15
  139. NuMarkdown-8B-Thinking - first reasoning OCR VLM -- 2025-08-11
  140. Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling -- 2025-08-11
  141. Vision Language Model Alignment in TRL ⚡️ -- 2025-08-11
  142. How the best image generation models work from the inside ? -- 2025-08-08
  143. AIDC-AI/Ovis-U1-3B -- 2025-08-08
  144. n0xa/SecKC-MHN-Globe -- 2025-08-08
  145. LLM-Based Identification of Infostealer Infection Vectors from Screenshots: The Case of Aurora -- 2025-08-08
  146. Reason ex Machina: Jailbreaking LLMs by Squeezing Their Brains | xayan.nu -- 2025-08-08
  147. Closing the Modality Gap for Mixed Modality Search -- 2025-08-07
  148. [Editorial] Mach-O binary analysis, with a focus on malware analysis and reverse engineering. -- 2025-07-31
  149. petqoo/ROGO -- 2025-07-31
  150. wyhlovecpp/GPT-Image-Edit -- 2025-07-31
  151. Show HN: MoebiusXBIN – ASCII and text-mode art editor with custom font support -- 2025-07-31
  152. boson-ai/higgs-audio -- 2025-07-24
  153. nvidia/canary-qwen-2.5b -- 2025-07-24
  154. TimeScope: How Long Can Your Video Large Multimodal Model Go? -- 2025-07-24
  155. ICML 2025 Outstanding Paper Awards -- 2025-07-24
  156. THUDM/GLM-4.1V-Thinking -- 2025-07-23
  157. FunAudioLLM/ThinkSound -- 2025-07-23
  158. RaphaelLiu/PusaV1 -- 2025-07-23
  159. merve/smol-vision -- 2025-07-23
  160. ChenDarYen/ComfyUI-NAG -- 2025-07-20
  161. Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation -- 2025-07-20
  162. quasiblob/ComfyUI-EsesImageEffectBloom -- 2025-07-18
  163. HiDream-ai/HiDream-E1-1 -- 2025-07-18
  164. Autoregressive Image Generation with Linear Complexity: A Spatial-Aware Decay Perspective -- 2025-07-18
  165. runjiali-rl/vmem -- 2025-07-17
  166. GeoArrow and GeoParquet, and the Future of Geospatial Data Analysis -- 2025-07-15
  167. TextPixs: Glyph-Conditioned Diffusion with Character-Aware Attention and OCR-Guided Supervision -- 2025-07-15
  168. Need advice on how to improve Handwritten Text Recognition of names using Vision models (for academic research purposes) -- 2025-07-14
  169. A fast 3D collision detection algorithm -- 2025-07-11
  170. 1000 African Voices: Advancing inclusive multi-speaker multi-accent speech synthesis -- 2025-07-06
  171. Hack Swaps Keys for Gang Signs, Everyone Gets In -- 2025-07-06
  172. Marker Gene Method : Identifying Stable Solutions in a Dynamic Environment -- 2025-07-06
  173. bytedance/ATI -- 2025-07-01
  174. AIDC-AI/Ovis-U1-3B -- 2025-07-01
  175. google/gemma-3n-E4B-it -- 2025-07-01
  176. baidu/ERNIE-4.5-21B-A3B-PT -- 2025-07-01
  177. bullerwins/FLUX.1-Kontext-dev-GGUF -- 2025-06-28
  178. google/gemma-3n-E2B-it -- 2025-06-28
  179. unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF -- 2025-06-26
  180. jinaai/jina-embeddings-v4 -- 2025-06-26
  181. Intelligent-Internet/II-Medical-8B-1706 -- 2025-06-26
  182. Tencent-Hunyuan/HunyuanPortrait -- 2025-06-24
  183. (0,2) hybrid models -- 2025-06-24
  184. 0-th Order Pseudo-differential Operator on the Circle -- 2025-06-24
  185. OmniGen2/OmniGen2 -- 2025-06-24
  186. gdhe17/Self-Forcing -- 2025-06-24
  187. lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill -- 2025-06-20
  188. Kijai/WanVideo_comfy -- 2025-06-20
  189. tencent/Hunyuan3D-2.1 -- 2025-06-20
  190. llama-server is cooking! gemma3 27b, 100K context, vision on one 24GB GPU. -- 2025-06-18
  191. I built a lightweight, private, MCP server to share context between AI tools -- 2025-06-18
  192. KVzip: Query-agnostic KV Cache Eviction — 3~4× memory reduction and 2× lower decoding latency -- 2025-06-18
  193. Ollama now supports streaming responses with tool calling -- 2025-06-18
  194. Chainlit or Open webui for production? -- 2025-06-18
  195. Ollama not releasing VRAM after running a model -- 2025-06-18
  196. Help Shape the Future of AI in India - Survey on Local vs Cloud LLM Usage (Developers/Students/AI Enthusiasts) -- 2025-06-18
  197. llmcontext: Attach you whole project in large context chats -- 2025-06-18
  198. Semantic search engine for ArXiv, biorxiv and medrxiv -- 2025-06-18
  199. Oodle 2.9.14 and Intel 13th/14th gen CPUs -- 2025-06-18
  200. A multi-turn tool-calling base model for RL agent training -- 2025-06-18
  201. GUI RAG that can do an unlimited number of documents, or at least many -- 2025-06-18
  202. showlab/OmniConsistency -- 2025-06-15
  203. echo840/MonkeyOCR -- 2025-06-15
  204. inclusionAI/Ming-Lite-Omni -- 2025-06-15
  205. New method for creating large 3D models of urban areas is faster and cheaper -- 2025-06-15
  206. showlab/D-AR -- 2025-06-14
  207. graphdeco-inria/on-the-fly-nvs -- 2025-06-14
  208. YOLO-World: Real-Time Open-Vocabulary Object Detection -- 2025-06-13
  209. 0ptical trapping with optical magnetic field and photonic Hall effect forces -- 2025-06-13
  210. (0,2) Mirror Symmetry on homogeneous Hopf surfaces -- 2025-06-13
  211. 0/1 Deep Neural Networks via Block Coordinate Descent -- 2025-06-10
  212. Hcompany/Holo1-3B -- 2025-06-04
  213. black-forest-labs/FLUX.1-schnell -- 2025-06-04
  214. PlayHT/PlayDiffusion -- 2025-06-04
  215. AMAP-ML/UniVG-R1 -- 2025-06-02
  216. showlab/OmniConsistency -- 2025-06-02
  217. tencent/HunyuanVideo-Avatar -- 2025-06-02
  218. Datadog/Toto-Open-Base-1.0 -- 2025-06-01
  219. 0.08 fF, 0.72 nA dark current, 91% Quantum Efficiency, 38 Gb/s Nano-photodetector on a 45 nm CMOS Silicon-Photonic Platform -- 2025-05-30
  220. 1000 FPS HDR Video With a Spike-RGB Hybrid Camera -- 2025-05-30
  221. 100,000 frames-per-second compressive imaging with a conventional rolling-shutter camera by random point-spread-function engineering -- 2025-05-29
  222. 1,000-Fold Enhancement of Light-Induced Magnetism in Plasmonic Au Nanoparticles -- 2025-05-29
  223. Model-Based Machine Learning (2023) -- 2025-05-28
  224. 0-MMS: Zero-Shot Multi-Motion Segmentation With A Monocular Event Camera -- 2025-05-28
  225. 0.8% Nyquist computational ghost imaging via non-experimental deep learning -- 2025-05-28