Qianwen (Kaia)Gao
Data Scientist
ML/NLP ยท Causal Inference ยท Growth Analysis
UC Berkeley grad student designing experimentation frameworks and predictive models. Passionate about turning large-scale data into product decisions that drive user engagement and growth.
About Me
From Communication to Computation
I started in Communication at Zhejiang University, fascinated by how information shapes behavior. That curiosity led me to data โ first analyzing user behavior, then optimizing product experience, and now researching how AI systems process information.
Today, I'm a Data Science grad student at UC Berkeley, designing experiments to understand why things happen, not just what happened. I'm proficient in causal inference, machine learning, and NLP, building predictive models, running A/B tests, and turning large-scale data into actionable insights.
I believe the best data scientists are storytellers who let the data speak.
User Behavior Analysis
Audience segmentation, retention modeling, and behavioral insights that drove revenue impact.
Causal Inference & Experimentation
A/B experiments, factorial design, multivariate regression, and statistical inference to inform decisions.
AI & LLM Research
RAG systems, content freshness benchmarks, and multi-agent automation for GenAI applications.
Skills & Technologies
A comprehensive toolkit for data science and growth analytics
Programming
Databases & Backend
Statistics
Machine Learning & Frameworks
Data Processing & Visualization
Tools & Workflow
Languages
Featured Projects
Data Science in action
Consumer Sentiment & Brand Insights from Amazon Fashion Reviews
Course Project | Oct 2025 โ Nov 2025
Analyzed 2.5M Amazon Fashion reviews to extract customer sentiment and brand perception using NLP techniques (VADER, BERT embeddings, topic modeling). Built regression and clustering models to identify key drivers of satisfaction and differentiate brand positioning. Visualized sentiment and keyword trends across categories through an interactive Streamlit dashboard, providing actionable insights for marketing and product strategy.
Predictive Absenteeism & Early-Warning Signal Analysis
Capstone Project | ONGB & Wizearly | Feb 2026 โ May 2026
Collaborated with Oakland Natives Give Back (ONGB) and Wizearly to analyze chronic absenteeism trends by synthesizing national datasets (NCES, Census) with thousands of granular OUSD student records. Engineered a Predictive Feature Library using the IPIR framework to identify behavioral and academic risk signals, validating national patterns against local data. Developed a dual-scale landscape report and interactive dashboard to provide data-driven intervention strategies for school district leadership.
Finfluencers Impact on trading behavior
Course Project | Nov 2025 โ Dec 2025
Investigated the causal impact of "finfluencer" (financial influencer) sentiment on stock trading liquidity using a balanced panel dataset of five major tech stocks (AAPL, AMZN, FB, NVDA, TSLA) from 2020 to 2022. Constructed a Panel OLS regression model with Entity Fixed Effects and clustered standard errors to control for unobserved heterogeneity and serial correlation. Identified that market volatility (VIX) and negative retail sentiment ("fear") are the primary drivers of trading volume, with the final model explaining 41% of day-to-day variance in trading activity.
California Housing Market Affordability Analysis
Course Project | Nov 2025 โ Dec 2025
Investigated the "Gravity of Affordability" in California housing markets by synthesizing construction permit data (HUD), sales volume (Redfin), and demographic trends (NIH) from 1980โ2022. Calculated Price-to-Income Ratios (PIR) to quantify affordability gaps across key counties like San Francisco and Riverside, revealing a decoupling of local incomes from housing costs. Visualized supply inelasticity and migration pressures using R (ggplot2) to demonstrate how low affordability drives population shifts despite stagnant construction responsiveness.
FreshRAG: Causal Benchmark for RAG Freshness & Hallucination
Research Project (In Progress) | 2026
Designed FreshRAG, a large-scale causal benchmark (50K+ QA pairs) to measure how content freshness reduces hallucination in retrieval-augmented generation (RAG) systems. Built a temporal-gradient dataset from multi-year knowledge snapshots and constructed controlled retrieval scenarios to isolate mechanisms including knowledge conflict resolution, temporal grounding, and parametric override. Implemented counterfactual evaluation protocols and mechanism-level effect decomposition, enabling regression-based and experimental estimation of freshness treatment effects across models and domains.
Consentful Civic Lens โ Event Organizer
CalHacks Project | Oct 2025
Built a full-stack web app with Next.js, Supabase, and PostgreSQL for event consent management and storytelling. Integrated Claude API and LiveKit for AI-generated highlight summaries, and developed a recommendation system to personalize future event suggestions based on user interests and location.
Experience & Leadership
Research, analytics, and leadership across tech and creative teams
AI Research Intern
Wrodium
Berkeley, CA ยท Dec 2025 โ Present
- โขCausal Benchmark Development โ Leading development of a research framework to quantify how content freshness reduces LLM hallucination through three causal mechanisms (Knowledge Conflict, Temporal Grounding, Parametric Override)
- โขTemporal QA Dataset Construction โ Built QA dataset using Myers diff for factual change detection; designed factorial experiments with logistic regression decomposition to isolate mechanism effects across 6 domains and multiple LLMs
- โขContent Pipeline Automation โ Engineered a multi-agent workflow using Make.com and LLM APIs to automate technical blog generation on Generative Engine Optimization (GEO), synthesizing retrieval-augmented generation (RAG) research into educational content
Strategy & Data Analyst Intern
APPA Health
Berkeley, CA ยท Sept โ Dec 2025
- โขMarket Opportunity Analysis โ Analyzed the educational funding landscape to identify and evaluate a pipeline of potential funding opportunities supporting youth wellness.
- โขImpact Measurement & Reporting โ Established a KPI framework to measure SEL program effectiveness. Analyzed pre- and post-program survey data to quantify impact on student engagement, providing key insights for program iteration and reporting to funding partners.
Marketing Analytics Intern
RedNote
Shanghai, China ยท Aug 2024 โ Jan 2025
- โขAudience Segmentation โ Queried and analyzed behavioral and demographic user data using SQL in Hive on a large-scale data warehouse to create 35 pet industry audience segments, contributing to ยฅ1.83M (~$250K) in ad revenue and improved ad targeting accuracy within the first month.
- โขKPI Automation โ Developed and automated marketing KPI dashboards using Python, SQL, and RedBI (BI tool comparable to Power BI) to track campaign performance, user engagement, and retention metrics. Presented findings and strategic recommendations to over 740 clients and internal stakeholders.
- โขMarketing Strategy โ Designed and analyzed A/B tests to optimize ad targeting strategies and creatives. Integrated CRM data to conduct deep-dive analyses on marketing performance, providing insights that improved marketing efficiency and ROI.
Product & User Analytics Intern
Didi
Hangzhou, China ยท Mar โ Jun 2024
- โขPricing Analytics โ Conducted multivariate regression and causal inference analyses on supply-demand patterns and user price elasticity to inform dynamic pricing strategies, leading to a 2% revenue lift.
- โขUser Research โ Designed and distributed user surveys to identify pain points in the "hourly driver" service; combined findings with SQL-based behavioral analysis to uncover actionable product insights, driving a 3% reduction in complaints and measurable improvement in driver-passenger experience.
Content Operation Intern
Huace Film & TV
Hangzhou, China ยท Jun โ Sept 2023
- โขContent Engagement Analysis โ Queried and analyzed 10,000+ follower records using SQL and Python to identify audience attributes and content preferences; created user clusters that informed strategy adjustments, boosting page views by 15.1%.
- โขA/B Testing โ Conducted A/B tests to refine video strategy; produced and distributed 300+ YouTube clips, leveraging insights to drive engagement from 620K+ global followers.
President
ZJU Lingyun Musical Club
Hangzhou, China ยท Sept 2021 โ May 2024
- โขManaged club operations across 8 departments with 150+ members; led the annual musical theatre production, drawing 6,000+ audience members.
- โขProduced an original musical commemorating the 40th anniversary of Chu Kochen Honors College, overseeing recruitment, script development, budgeting, and cross-team coordination.
Education
Master of Computational Social Science
University of California, Berkeley
Berkeley, CA ยท Jun 2025 โ Present
- โขGPA: 3.87/4.00
- โขRelevant Coursework: Advanced Computing, Machine Learning, Advanced Applied Statistics, Data Visualization, Deep Learning for Visual Data (DeCal)
Bachelor of Arts, Communication
Zhejiang University (ZJU)
Hangzhou, China ยท Sept 2021 โ Jun 2025
- โขGPA: 3.95/4.00
- โขRelevant Coursework: Big Data Analytics, Advanced Mathematics, Probability and Mathematical Statistics, Python Programming, Introduction to Research Methodology in Social Sciences
Resume
View or download my resume for references
Qianwen (Kaia) Gao
Data Scientist ยท Berkeley, CA
Get In Touch
Always happy to chat about data science, career opportunities, or the latest industry trends!
Contact Information
Phone
+1 (510) 542-6385GitHub
github.com/kaiagaooLocation
Berkeley, CA