Python API Reference
Complete API reference for Veloxx Python bindings.
Installation
pip install veloxx
Quick Start
import veloxx as vx
# Load data
df = vx.read_csv("data.csv")
# Basic operations
filtered = df.filter(df["age"] > 25)
grouped = df.groupby("department").mean()
Core Classes
PyDataFrame
The main data structure for working with tabular data in Python.
Constructors
Creates a new DataFrame from a dictionary of column names to PySeries.
Parameters:
columns: dict - Dictionary mapping column names to PySeries objects
Example:
import veloxx as vx
df = vx.PyDataFrame({
"name": vx.PySeries("name", ["Alice", "Bob", "Charlie"]),
"age": vx.PySeries("age", [25, 30, 35]),
"salary": vx.PySeries("salary", [50000.0, 75000.0, 60000.0])
})
Class Methods
Loads a DataFrame from a CSV file with automatic type inference.
Parameters:
path: str - Path to the CSV file
Example:
df = vx.PyDataFrame.from_csv("data/employees.csv")
print(f"Loaded {df.row_count()} rows")
Loads a DataFrame from a JSON file.
Parameters:
path: str - Path to the JSON file
Example:
df = vx.PyDataFrame.from_json("data/users.json")
Properties
Returns the number of rows in the DataFrame.
Example:
print(f"DataFrame has {df.row_count()} rows")
Returns the number of columns in the DataFrame.
Example:
print(f"DataFrame has {df.column_count()} columns")
Returns a list of column names.
Example:
names = df.column_names()
for name in names:
print(f"Column: {name}")
Data Access
Gets a column by name.
Parameters:
name: str - Name of the column to retrieve
Example:
age_column = df.get_column("age")
if age_column:
print(f"Age column has {age_column.len()} values")
Gets a column using bracket notation (syntactic sugar).
Example:
# These are equivalent
age1 = df.get_column("age")
age2 = df["age"]
Data Manipulation
Filters rows by index positions.
Parameters:
row_indices: List[int] - List of row indices to keep
Example:
# Filter rows where age > 25
age_series = df.get_column("age")
indices = [i for i, age in enumerate(age_series.to_list()) if age and age > 25]
filtered_df = df.filter(indices)
Selects specific columns from the DataFrame.
Parameters:
names: List[str] - Names of columns to select
Example:
selected = df.select_columns(["name", "age"])
Removes specified columns from the DataFrame.
Parameters:
names: List[str] - Names of columns to drop
Example:
without_id = df.drop_columns(["id"])
Renames a column in the DataFrame.
Parameters:
old_name: str - Current name of the column
new_name: str - New name for the column
Example:
renamed = df.rename_column("age", "years")
Adds a new column or replaces an existing one using an expression.
Parameters:
name: str - Name of the new column
expr: PyExpr - Expression to compute the column values
Example:
# Add a column with salary + 1000 bonus
expr = vx.PyExpr.add(
vx.PyExpr.column("salary"),
vx.PyExpr.literal(1000.0)
)
with_bonus = df.with_column("salary_with_bonus", expr)
Grouping and Aggregation
Groups the DataFrame by specified columns.
Parameters:
by_columns: List[str] - Columns to group by
Example:
grouped = df.group_by(["department"])
result = grouped.mean()
Generates descriptive statistics for numeric columns.
Example:
stats = df.describe()
print(stats)
Statistical Methods
Calculates the Pearson correlation between two numeric columns.
Parameters:
col1_name: str - Name of the first column
col2_name: str - Name of the second column
Example:
corr = df.correlation("age", "salary")
print(f"Age-Salary correlation: {corr:.3f}")
Calculates the covariance between two numeric columns.
Parameters:
col1_name: str - Name of the first column
col2_name: str - Name of the second column
Example:
cov = df.covariance("age", "salary")
print(f"Age-Salary covariance: {cov:.2f}")
Joining
Joins this DataFrame with another DataFrame.
Parameters:
other: PyDataFrame - DataFrame to join with
on_column: str - Column name to join on
join_type: PyJoinType - Type of join (Inner, Left, Right)
Example:
joined = df1.join(df2, "user_id", vx.PyJoinType.Inner)
Sorting and Ordering
Sorts the DataFrame by specified columns.
Parameters:
by_columns: List[str] - Columns to sort by
ascending: bool - Sort order (default: True)
Example:
sorted_df = df.sort(["age", "name"], ascending=True)
Data Cleaning
Removes rows containing any null values.
Example:
clean_df = df.drop_nulls()
Fills null values with a specified value.
Parameters:
value: Any - Value to use for filling nulls
Example:
filled = df.fill_nulls(0) # Fill with 0
filled_str = df.fill_nulls("Unknown") # Fill with string
I/O Operations
Writes the DataFrame to a CSV file.
Parameters:
path: str - Output file path
Example:
df.to_csv("output/results.csv")
Concatenation
Appends another DataFrame vertically.
Parameters:
other: PyDataFrame - DataFrame to append
Example:
combined = df1.append(df2)
PyGroupedDataFrame
Represents a grouped DataFrame for aggregation operations.
Aggregation Methods
Calculates the sum for each group.
Example:
grouped = df.group_by(["department"])
sums = grouped.sum()
Calculates the mean for each group.
Example:
averages = grouped.mean()
Counts values for each group.
Example:
counts = grouped.count()
Finds the minimum value for each group.
Example:
minimums = grouped.min()
Finds the maximum value for each group.
Example:
maximums = grouped.max()
Performs custom aggregations.
Parameters:
aggregations: List[Tuple[str, str]] - List of (column, aggregation_function) tuples
Example:
result = grouped.agg([
("salary", "mean"),
("age", "count"),
("experience", "max")
])
PySeries
Represents a single column of data.
Constructors
Creates a new Series with automatic type inference.
Parameters:
name: str - Name of the series
data: List[Any] - List of values (supports None for nulls)
Example:
# Integer series
ages = vx.PySeries("age", [25, 30, None, 35])
# String series
names = vx.PySeries("name", ["Alice", "Bob", None, "Charlie"])
# Float series
salaries = vx.PySeries("salary", [50000.0, 75000.0, 60000.0])
# Boolean series
active = vx.PySeries("is_active", [True, False, True])
Properties
Returns the name of the Series.
Example:
print(f"Series name: {series.name()}")
Returns the length of the Series.
Example:
print(f"Series has {series.len()} values")
Checks if the Series is empty.
Example:
if series.is_empty():
print("Series is empty")
Returns the data type of the Series.
Example:
dtype = series.data_type()
print(f"Series type: {dtype}")
Data Access
Gets the value at a specific index.
Parameters:
index: int - Index of the value to retrieve
Example:
first_value = series.get_value(0)
print(f"First value: {first_value}")
Converts the Series to a Python list.
Example:
values = series.to_list()
for value in values:
if value is not None:
print(value)
Statistical Methods
Calculates the sum of numeric values.
Example:
total = series.sum()
print(f"Sum: {total}")
Calculates the mean of numeric values.
Example:
average = series.mean()
print(f"Average: {average}")
Calculates the median of numeric values.
Example:
median = series.median()
print(f"Median: {median}")
Finds the minimum value.
Example:
minimum = series.min()
print(f"Minimum: {minimum}")
Finds the maximum value.
Example:
maximum = series.max()
print(f"Maximum: {maximum}")
Calculates the standard deviation.
Example:
std_dev = series.std()
print(f"Standard deviation: {std_dev}")
Counts non-null values.
Example:
non_null_count = series.count()
print(f"Non-null values: {non_null_count}")
Returns a Series with unique values.
Example:
unique_values = series.unique()
print(f"Unique values: {unique_values.len()}")
Data Manipulation
Filters the Series by index positions.
Parameters:
row_indices: List[int] - List of indices to keep
Example:
filtered = series.filter([0, 2, 4]) # Keep indices 0, 2, 4
Fills null values with a specified value.
Parameters:
value: Any - Value to use for filling nulls
Example:
filled = series.fill_nulls(0)
PyExpr
Represents expressions for computed columns.
Static Methods
Creates a column reference expression.
Parameters:
name: str - Name of the column to reference
Example:
expr = vx.PyExpr.column("salary")
Creates a literal value expression.
Parameters:
value: Any - The literal value
Example:
expr = vx.PyExpr.literal(1000.0)
Arithmetic Operations
Creates an addition expression.
Example:
expr = vx.PyExpr.add(
vx.PyExpr.column("base_salary"),
vx.PyExpr.column("bonus")
)
Creates a subtraction expression.
Example:
expr = vx.PyExpr.subtract(
vx.PyExpr.column("revenue"),
vx.PyExpr.column("costs")
)
Creates a multiplication expression.
Example:
expr = vx.PyExpr.multiply(
vx.PyExpr.column("quantity"),
vx.PyExpr.column("price")
)
Creates a division expression.
Example:
expr = vx.PyExpr.divide(
vx.PyExpr.column("total_sales"),
vx.PyExpr.column("num_customers")
)
PyJoinType
Enumeration for join types.
class PyJoinType:
Inner = "Inner"
Left = "Left"
Right = "Right"
Example:
joined = df1.join(df2, "user_id", vx.PyJoinType.Left)
Convenience Functions
Data Loading
Convenience function to load CSV files.
Example:
import veloxx as vx
df = vx.read_csv("data.csv")
Convenience function to load JSON files.
Example:
df = vx.read_json("data.json")
Usage Patterns
Basic Data Analysis
import veloxx as vx
# Load data
df = vx.read_csv("sales_data.csv")
# Basic info
print(f"Dataset: {df.row_count()} rows, {df.column_count()} columns")
print(f"Columns: {df.column_names()}")
# Filter high-value sales
high_value_indices = []
amount_series = df.get_column("amount")
for i, amount in enumerate(amount_series.to_list()):
if amount and amount > 1000:
high_value_indices.append(i)
high_value_sales = df.filter(high_value_indices)
# Group by and aggregate
summary = high_value_sales.group_by(["region"]).agg([
("amount", "sum"),
("amount", "mean"),
("customer_id", "count")
])
print(summary)
Advanced Analytics
import veloxx as vx
def analyze_customer_data():
# Load customer data
customers = vx.read_csv("customers.csv")
orders = vx.read_csv("orders.csv")
# Join datasets
customer_orders = customers.join(orders, "customer_id", vx.PyJoinType.Inner)
# Calculate customer lifetime value
clv_expr = vx.PyExpr.multiply(
vx.PyExpr.column("order_value"),
vx.PyExpr.column("order_frequency")
)
with_clv = customer_orders.with_column("lifetime_value", clv_expr)
# Segment customers
high_value_indices = []
clv_series = with_clv.get_column("lifetime_value")
for i, clv in enumerate(clv_series.to_list()):
if clv and clv > 5000:
high_value_indices.append(i)
high_value_customers = with_clv.filter(high_value_indices)
# Analyze by segment
segment_analysis = high_value_customers.group_by(["customer_segment"]).agg([
("lifetime_value", "mean"),
("order_frequency", "mean"),
("customer_id", "count")
])
return segment_analysis
# Run analysis
results = analyze_customer_data()
print(results)
Data Cleaning Pipeline
import veloxx as vx
def clean_dataset(df):
"""Clean and prepare dataset for analysis"""
# Remove rows with missing critical data
clean_df = df.drop_nulls()
# Fill missing values in optional columns
filled_df = clean_df.fill_nulls("Unknown")
# Remove outliers (example: ages > 100)
age_series = filled_df.get_column("age")
valid_indices = []
for i, age in enumerate(age_series.to_list()):
if age and 0 <= age <= 100:
valid_indices.append(i)
filtered_df = filled_df.filter(valid_indices)
# Standardize column names
standardized = filtered_df.rename_column("customer_name", "name")
standardized = standardized.rename_column("customer_age", "age")
return standardized
# Usage
raw_data = vx.read_csv("raw_customer_data.csv")
clean_data = clean_dataset(raw_data)
clean_data.to_csv("clean_customer_data.csv")
Performance Tips
- Use appropriate data types: Let Veloxx infer types automatically for best performance
- Filter early: Apply filters before expensive operations like joins
- Use vectorized operations: Leverage expressions instead of loops
- Process in chunks: For very large datasets, process in smaller chunks
- Minimize data copying: Chain operations when possible
Error Handling
import veloxx as vx
try:
df = vx.read_csv("data.csv")
result = df.group_by(["category"]).mean()
result.to_csv("output.csv")
except FileNotFoundError:
print("Input file not found")
except Exception as e:
print(f"Error processing data: {e}")
Integration with Pandas
Convert between Veloxx and Pandas for interoperability:
import veloxx as vx
import pandas as pd
# Pandas to Veloxx
def pandas_to_veloxx(pandas_df):
columns = {}
for col in pandas_df.columns:
data = pandas_df[col].tolist()
# Convert NaN to None
data = [None if pd.isna(x) else x for x in data]
columns[col] = vx.PySeries(col, data)
return vx.PyDataFrame(columns)
# Veloxx to Pandas
def veloxx_to_pandas(veloxx_df):
data = {}
for col_name in veloxx_df.column_names():
series = veloxx_df.get_column(col_name)
data[col_name] = series.to_list()
return pd.DataFrame(data)
# Usage
pandas_df = pd.read_csv("data.csv")
veloxx_df = pandas_to_veloxx(pandas_df)
# Process with Veloxx (faster)
result = veloxx_df.group_by(["category"]).mean()
# Convert back to Pandas if needed
result_pandas = veloxx_to_pandas(result)