Is there a way to explode a column? #264
-
|
I'm working on a plugin (https://github.com/JaoMarcos/data_designer_lambda_column The Use Case I need to support cases where a single input record produces multiple output records (1:N), essentially "exploding" the dataframe. The main driver for this is cost and efficiency with LLMs. For complex prompts with large input contexts, if I need multiple variations (e.g., "Generate 5 variations of X"), it is significantly cheaper and faster to ask the model to generate all 5 in a single API call rather than making 5 separate calls with the same large input. Generating them in a single pass also often improves quality/variance, as the model has "in-context" awareness of the other variations it is generating, preventing duplicates. Question What is the best way to handle this in DatasetBatchManager? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
|
Looks like @andreatgretel has started looking into this as part of issue #265! Please feel free to continue the discussion here or as part of the issue if you have any other questions / feedback 🙌 |
Beta Was this translation helpful? Give feedback.
-
|
Hi @JaoMarcos, quick follow-up here. We did add the underlying row-resize capability for column generators, so 1:N and N:1 behavior is now supported for custom columns and plugin column generators via So the short version is: yes, the engine can now handle “explode/retract” style column tasks, but it is exposed today as a capability on custom/plugin generators rather than as a separate built-in For a plugin column generator, the config opts in with from typing import Literal
from data_designer.config.base import SingleColumnConfig
class VariationsColumnConfig(SingleColumnConfig):
column_type: Literal["variations-column"] = "variations-column"
allow_resize: bool = True
@property
def required_columns(self) -> list[str]:
return ["prompt"]
@property
def side_effect_columns(self) -> list[str]:
return ["variation_id"]Then the implementation can expand one input row into many output rows: import pandas as pd
from data_designer.engine.column_generators.generators.base import ColumnGeneratorFullColumn
class VariationsColumnGenerator(ColumnGeneratorFullColumn[VariationsColumnConfig]):
def generate(self, data: pd.DataFrame) -> pd.DataFrame:
rows = []
for _, row in data.iterrows():
for i in range(3):
rows.append({
"prompt": row["prompt"],
"completion": f"variation {i + 1} for {row['prompt']}",
"variation_id": i,
})
return pd.DataFrame(rows)For a custom column, same idea. Set FULL_COLUMN example: import pandas as pd
import data_designer.config as dd
@dd.custom_column_generator(
required_columns=["topic"],
side_effect_columns=["variation_id"],
)
def expand_topics(df: pd.DataFrame) -> pd.DataFrame:
rows = []
for _, row in df.iterrows():
for i in range(3):
rows.append({
"topic": row["topic"],
"question": f"Question {i + 1} about {row['topic']}",
"variation_id": i,
})
return pd.DataFrame(rows)
config_builder.add_column(
dd.CustomColumnConfig(
name="question",
generator_function=expand_topics,
generation_strategy=dd.GenerationStrategy.FULL_COLUMN,
allow_resize=True,
)
)CELL_BY_CELL example: import data_designer.config as dd
@dd.custom_column_generator(required_columns=["id"])
def expand_row(row: dict) -> list[dict]:
return [
{**row, "variant": "a"},
{**row, "variant": "b"},
]
config_builder.add_column(
dd.CustomColumnConfig(
name="variant",
generator_function=expand_row,
generation_strategy=dd.GenerationStrategy.CELL_BY_CELL,
allow_resize=True,
)
)This path is well supported in the default synchronous engine today. One current caveat: If helpful, we can also add a docs example that mirrors your original “generate N variations from one input row” use case more directly. |
Beta Was this translation helpful? Give feedback.
Hi @JaoMarcos, quick follow-up here.
We did add the underlying row-resize capability for column generators, so 1:N and N:1 behavior is now supported for custom columns and plugin column generators via
allow_resize=True.So the short version is: yes, the engine can now handle “explode/retract” style column tasks, but it is exposed today as a capability on custom/plugin generators rather than as a separate built-in
explodecolumn type.For a plugin column generator, the config opts in with
allow_resize=True: