Kaia Gao

Qianwen (Kaia)Gao

Data Scientist

ML/NLP ยท Causal Inference ยท Growth Analysis

UC Berkeley grad student designing experimentation frameworks and predictive models. Passionate about turning large-scale data into product decisions that drive user engagement and growth.

About Me

From Communication to Computation

I started in Communication at Zhejiang University, fascinated by how information shapes behavior. That curiosity led me to data โ€” first analyzing user behavior, then optimizing product experience, and now researching how AI systems process information.

Today, I'm a Data Science grad student at UC Berkeley, designing experiments to understand why things happen, not just what happened. I'm proficient in causal inference, machine learning, and NLP, building predictive models, running A/B tests, and turning large-scale data into actionable insights.

I believe the best data scientists are storytellers who let the data speak.

User Behavior Analysis

Audience segmentation, retention modeling, and behavioral insights that drove revenue impact.

Causal Inference & Experimentation

A/B experiments, factorial design, multivariate regression, and statistical inference to inform decisions.

AI & LLM Research

RAG systems, content freshness benchmarks, and multi-agent automation for GenAI applications.

Skills & Technologies

A comprehensive toolkit for data science and growth analytics

Programming

๐ŸPython (pandas, numpy, scikit-learn)
๐Ÿ—„๏ธSQL
๐Ÿ“ŠR
๐Ÿ’ปHTML, CSS, JavaScript
โšกNext.js

Databases & Backend

๐Ÿ”ŒSupabase
๐Ÿ˜PostgreSQL
๐Ÿ“ฆMySQL
๐Ÿ”—RESTful APIs

Statistics

๐ŸงชA/B Testing
๐Ÿ“Causal Inference
๐Ÿ“ˆRegression Analysis
๐ŸŽฒBayesian Methods

Machine Learning & Frameworks

๐Ÿค–Predictive Modeling
๐Ÿ”ฅPyTorch
๐Ÿง TensorFlow

Data Processing & Visualization

๐Ÿ“ŠTableau
๐ŸŽจMatplotlib
๐Ÿ“ˆSeaborn
๐Ÿ“‰Plotly
๐Ÿ–ฅ๏ธStreamlit

Tools & Workflow

๐Ÿ”งGit
๐Ÿ™GitHub
๐Ÿ““Jupyter
โ˜๏ธGoogle Colab
๐Ÿ“‹Excel

Languages

๐ŸŒEnglish (Professional)
๐Ÿ‡จ๐Ÿ‡ณMandarin (Native)

Featured Projects

Data Science in action

๐Ÿ‘—

Consumer Sentiment & Brand Insights from Amazon Fashion Reviews

Course Project | Oct 2025 โ€“ Nov 2025

Analyzed 2.5M Amazon Fashion reviews to extract customer sentiment and brand perception using NLP techniques (VADER, BERT embeddings, topic modeling). Built regression and clustering models to identify key drivers of satisfaction and differentiate brand positioning. Visualized sentiment and keyword trends across categories through an interactive Streamlit dashboard, providing actionable insights for marketing and product strategy.

PythonVADERBERTStreamlitScikit-learn
๐ŸŽ“

Predictive Absenteeism & Early-Warning Signal Analysis

Capstone Project | ONGB & Wizearly | Feb 2026 โ€“ May 2026

Collaborated with Oakland Natives Give Back (ONGB) and Wizearly to analyze chronic absenteeism trends by synthesizing national datasets (NCES, Census) with thousands of granular OUSD student records. Engineered a Predictive Feature Library using the IPIR framework to identify behavioral and academic risk signals, validating national patterns against local data. Developed a dual-scale landscape report and interactive dashboard to provide data-driven intervention strategies for school district leadership.

PythonSQLPandasScikit-LearnTableauStatistical Modeling
๐Ÿ“ˆ

Finfluencers Impact on trading behavior

Course Project | Nov 2025 โ€“ Dec 2025

Investigated the causal impact of "finfluencer" (financial influencer) sentiment on stock trading liquidity using a balanced panel dataset of five major tech stocks (AAPL, AMZN, FB, NVDA, TSLA) from 2020 to 2022. Constructed a Panel OLS regression model with Entity Fixed Effects and clustered standard errors to control for unobserved heterogeneity and serial correlation. Identified that market volatility (VIX) and negative retail sentiment ("fear") are the primary drivers of trading volume, with the final model explaining 41% of day-to-day variance in trading activity.

Panel OLS RegressionFixed Effects ModelingHypothesis TestingEconometricsStatistical Analysis
๐Ÿก

California Housing Market Affordability Analysis

Course Project | Nov 2025 โ€“ Dec 2025

Investigated the "Gravity of Affordability" in California housing markets by synthesizing construction permit data (HUD), sales volume (Redfin), and demographic trends (NIH) from 1980โ€“2022. Calculated Price-to-Income Ratios (PIR) to quantify affordability gaps across key counties like San Francisco and Riverside, revealing a decoupling of local incomes from housing costs. Visualized supply inelasticity and migration pressures using R (ggplot2) to demonstrate how low affordability drives population shifts despite stagnant construction responsiveness.

Rggplot2dplyrHexData Visualization
๐Ÿง 

FreshRAG: Causal Benchmark for RAG Freshness & Hallucination

Research Project (In Progress) | 2026

Designed FreshRAG, a large-scale causal benchmark (50K+ QA pairs) to measure how content freshness reduces hallucination in retrieval-augmented generation (RAG) systems. Built a temporal-gradient dataset from multi-year knowledge snapshots and constructed controlled retrieval scenarios to isolate mechanisms including knowledge conflict resolution, temporal grounding, and parametric override. Implemented counterfactual evaluation protocols and mechanism-level effect decomposition, enabling regression-based and experimental estimation of freshness treatment effects across models and domains.

PythonRAGCausal InferenceExperimental DesignNLPLLM Evaluation
๐ŸŽญ

Consentful Civic Lens โ€“ Event Organizer

CalHacks Project | Oct 2025

Built a full-stack web app with Next.js, Supabase, and PostgreSQL for event consent management and storytelling. Integrated Claude API and LiveKit for AI-generated highlight summaries, and developed a recommendation system to personalize future event suggestions based on user interests and location.

Next.jsSupabasePostgreSQLClaude APILiveKit

Experience & Leadership

Research, analytics, and leadership across tech and creative teams

AI Research Intern

Wrodium

Berkeley, CA ยท Dec 2025 โ€“ Present

  • โ€ขCausal Benchmark Development โ€“ Leading development of a research framework to quantify how content freshness reduces LLM hallucination through three causal mechanisms (Knowledge Conflict, Temporal Grounding, Parametric Override)
  • โ€ขTemporal QA Dataset Construction โ€“ Built QA dataset using Myers diff for factual change detection; designed factorial experiments with logistic regression decomposition to isolate mechanism effects across 6 domains and multiple LLMs
  • โ€ขContent Pipeline Automation โ€“ Engineered a multi-agent workflow using Make.com and LLM APIs to automate technical blog generation on Generative Engine Optimization (GEO), synthesizing retrieval-augmented generation (RAG) research into educational content

Strategy & Data Analyst Intern

APPA Health

Berkeley, CA ยท Sept โ€“ Dec 2025

  • โ€ขMarket Opportunity Analysis โ€“ Analyzed the educational funding landscape to identify and evaluate a pipeline of potential funding opportunities supporting youth wellness.
  • โ€ขImpact Measurement & Reporting โ€“ Established a KPI framework to measure SEL program effectiveness. Analyzed pre- and post-program survey data to quantify impact on student engagement, providing key insights for program iteration and reporting to funding partners.

Marketing Analytics Intern

RedNote

Shanghai, China ยท Aug 2024 โ€“ Jan 2025

  • โ€ขAudience Segmentation โ€“ Queried and analyzed behavioral and demographic user data using SQL in Hive on a large-scale data warehouse to create 35 pet industry audience segments, contributing to ยฅ1.83M (~$250K) in ad revenue and improved ad targeting accuracy within the first month.
  • โ€ขKPI Automation โ€“ Developed and automated marketing KPI dashboards using Python, SQL, and RedBI (BI tool comparable to Power BI) to track campaign performance, user engagement, and retention metrics. Presented findings and strategic recommendations to over 740 clients and internal stakeholders.
  • โ€ขMarketing Strategy โ€“ Designed and analyzed A/B tests to optimize ad targeting strategies and creatives. Integrated CRM data to conduct deep-dive analyses on marketing performance, providing insights that improved marketing efficiency and ROI.

Product & User Analytics Intern

Didi

Hangzhou, China ยท Mar โ€“ Jun 2024

  • โ€ขPricing Analytics โ€“ Conducted multivariate regression and causal inference analyses on supply-demand patterns and user price elasticity to inform dynamic pricing strategies, leading to a 2% revenue lift.
  • โ€ขUser Research โ€“ Designed and distributed user surveys to identify pain points in the "hourly driver" service; combined findings with SQL-based behavioral analysis to uncover actionable product insights, driving a 3% reduction in complaints and measurable improvement in driver-passenger experience.

Content Operation Intern

Huace Film & TV

Hangzhou, China ยท Jun โ€“ Sept 2023

  • โ€ขContent Engagement Analysis โ€“ Queried and analyzed 10,000+ follower records using SQL and Python to identify audience attributes and content preferences; created user clusters that informed strategy adjustments, boosting page views by 15.1%.
  • โ€ขA/B Testing โ€“ Conducted A/B tests to refine video strategy; produced and distributed 300+ YouTube clips, leveraging insights to drive engagement from 620K+ global followers.

President

ZJU Lingyun Musical Club

Hangzhou, China ยท Sept 2021 โ€“ May 2024

  • โ€ขManaged club operations across 8 departments with 150+ members; led the annual musical theatre production, drawing 6,000+ audience members.
  • โ€ขProduced an original musical commemorating the 40th anniversary of Chu Kochen Honors College, overseeing recruitment, script development, budgeting, and cross-team coordination.

Education

Master of Computational Social Science

University of California, Berkeley

Berkeley, CA ยท Jun 2025 โ€“ Present

  • โ€ขGPA: 3.87/4.00
  • โ€ขRelevant Coursework: Advanced Computing, Machine Learning, Advanced Applied Statistics, Data Visualization, Deep Learning for Visual Data (DeCal)

Bachelor of Arts, Communication

Zhejiang University (ZJU)

Hangzhou, China ยท Sept 2021 โ€“ Jun 2025

  • โ€ขGPA: 3.95/4.00
  • โ€ขRelevant Coursework: Big Data Analytics, Advanced Mathematics, Probability and Mathematical Statistics, Python Programming, Introduction to Research Methodology in Social Sciences

Resume

View or download my resume for references

KG

Qianwen (Kaia) Gao

Data Scientist ยท Berkeley, CA

Get In Touch

Always happy to chat about data science, career opportunities, or the latest industry trends!

Contact Information